Matching data from unequal length data frames in r - r

This seems like it should be really simple. Ive 2 data frames of unequal length in R. one is simply a random subset of the larger data set. Therefore, they have the same exact data and a UniqueID that is exactly the same. What I would like to do is put an indicator say a 0 or 1 in the larger data set that says this row is in the smaller data set.
I can use which(long$UniqID %in% short$UniqID) but I can't seem to figure out how to match this indicator back to the long data set

Made same sample data.
long<-data.frame(UniqID=sample(letters[1:20],20))
short<-data.frame(UniqID=sample(letters[1:20],10))
You can use %in% without which() to get values TRUE and FALSE and then with as.numeric() convert them to 0 and 1.
long$sh<-as.numeric(long$UniqID %in% short$UniqID)

I'll use #AnandaMahto's data to illustrate another way using duplicated which also works if you've a unique ID or not.
Case 1: Has unique id column
set.seed(1)
df1 <- data.frame(ID = 1:10, A = rnorm(10), B = rnorm(10))
df2 <- df1[sample(10, 4), ]
transform(df1, indicator = 1 * duplicated(rbind(df2, df1)[, "ID",
drop=FALSE])[-seq_len(nrow(df2))])
Case 2: Has no unique id column
set.seed(1)
df1 <- data.frame(A = rnorm(10), B = rnorm(10))
df2 <- df1[sample(10, 4), ]
transform(df1, indicator = 1 * duplicated(rbind(df2, df1))[-seq_len(nrow(df2))])

The answers so far are good. However, a question was raised, "what if there wasn't a "UniqID" column?
At that point, perhaps merge can be of assistance:
Here's an example using merge and %in% where an ID is available:
set.seed(1)
df1 <- data.frame(ID = 1:10, A = rnorm(10), B = rnorm(10))
df2 <- df1[sample(10, 4), ]
temp <- merge(df1, df2, by = "ID")$ID
df1$matches <- as.integer(df1$ID %in% temp)
And, a similar example where an ID isn't available.
set.seed(1)
df1_NoID <- data.frame(A = rnorm(10), B = rnorm(10))
df2_NoID <- df1_NoID[sample(10, 4), ]
temp <- merge(df1_NoID, df2_NoID, by = "row.names")$Row.names
df1_NoID$matches <- as.integer(rownames(df1_NoID) %in% temp)

You can directly use the logical vector as a new column:
long$Indicator <- 1*(long$UniqID %in% short$UniqID)

See if this can get you started:
long <- data.frame(UniqID=sample(1:100)) #creating a long data frame
short <- data.frame(UniqID=long[sample(1:100, 30), ]) #creating a short one with the same ids.
long$indicator <- long$UniqID %in% short$UniqID #creating an indicator column in long.
> head(long)
UniqID indicator
1 87 TRUE
2 15 TRUE
3 100 TRUE
4 40 FALSE
5 89 FALSE
6 21 FALSE

Related

How can I create a function to generate new variables based on values in different dataframe in R

I would like to create a function like this (obviously not proper code):
forEach ID in DATAFRAME1 look at each row with ID in DATAFRAME2 {
if DATAFRAME2$VARIABLE1 = something {
DATAFRAME1$VARIABLE1 = TRUE;
DATAFRAME1$VARIABLE2 = DATAFRAME2$VARIABLE2
}
}
In plain text, I've got a list of individuals and a database with mixed information on these
individuals. Let's say DATAFRAME2 contains informations on books read c(id, title, author, date). I want to create a new variable in DATAFRAME1 with a boolean of if the individual has read a specific book (VARIABLE1 above) and the date they first read it (VARIABLE2above). Also adding a third variable with number of times read would be interesting but not neccesary.
I haven't really done this in R before, mostly doing basic statistics and basic wrangling with dplyr. I guess I could use dplyr and join but this feels like a better approach. Any help to get me started would be much appreciated.
The following function does what the question asks for. Its arguments are
DF1 and DF2 have an obvious meaning;
var1 and var2 are VARIABLE1 and VARIABLE2 in the question;
value is the value of something.
The test data is at the end.
fun <- function(DF1, DF2, ID = 'ID', var1, var2, value){
DF1[[var1]] <- NA
DF1[[var2]] <- NA
k <- DF2[[var1]] == value
for(id in df1[[ID]]){
i <- DF1[[ID]] == id
j <- DF2[[ID]] == id
if(any(j & k)){
DF1[[var1]][i] <- TRUE
DF1[[var2]][i] <- DF2[[var2]][j & k]
}
}
DF1
}
fun(df1, df2, value = 4, var1 = 'X', var2 = 'Y')
# ID X Y
#1 a NA NA
#2 d TRUE 19
Test data.
set.seed(1234)
df1 <- data.frame(ID = c("a", "d"))
df2 <- data.frame(ID = rep(letters[1:5], 4),
X = sample(20, 20, TRUE),
Y = sample(20))

Filter dataframe by vector of column names and constant column names

This is surely easy but for the life of me I can't find the right syntax.
I want to keep all "ID_" columns, regardless of the number of columns and the numbers attached, and keep other columns by constant name.
Something like the below command that doesn't work (on the recreated data, every time):
###Does not work, but shows what I am trying to do
testdf1 <- df1[,c(paste(idvec, collapse="','"),"ConstantNames_YESwant")]
Recreated data:
rand <- sample(1:2, 1)
if(rand==1){
df1 <- data.frame(
ID_0=0,
ID_1=1,
ID_2=11,
ID_3=111,
LotsOfColumnsWithVariousNames_NOwant="unwanted_data",
ConstantNames_YESwant="wanted_data",
stringsAsFactors = FALSE
)
desired.df1 <- data.frame(
ID_0=0,
ID_1=1,
ID_2=11,
ID_3=111,
ConstantNames_YESwant="wanted_data",
stringsAsFactors = FALSE
)
}
if(rand==2){
df1 <- data.frame(
ID_0=0,
ID_1=1,
LotsOfColumnsWithVariousNames_NOwant="unwanted_data",
ConstantNames_YESwant="wanted_data",
stringsAsFactors = FALSE
)
desired.df1 <- data.frame(
ID_0=0,
ID_1=1,
ConstantNames_YESwant="wanted_data",
stringsAsFactors = FALSE
)
}
Is this what you want?
library(tidyverse)
df1 %>%
select(matches("ID_*"), ConstantNames_YESwant)
df1 %>%
select(starts_with("ID"), ConstantNames_YESwant)
# ID_0 ID_1 ConstantNames_YESwant
# 1 0 1 wanted_data
In base R , you could do
#Get all the ID columns
idvec <- grep("ID", colnames(df1), value = TRUE)
#Select ID columns and the constant names you want.
df1[c(idvec, "ConstantNames_YESwant")]
# ID_0 ID_1 ConstantNames_YESwant
#1 0 1 wanted_data

adding unique rows from one data frame to another

I have a data frame which comprises a subset of records contained in a 2nd data frame. I would like to add the record rows of the 2nd data frame that are not common in the first data frame to the first... Thank you.
If you want all unique rows from both dataframes, this would work:
df1 <- data.frame(X = c('A','B','C'), Y = c(1,2,3))
df2 <- data.frame(X = 'A', Y = 1)
df <- rbind(df1,df2)
no.dupes <- df[!duplicated(df),]
no.dupes
# X Y
#1 A 1
#2 B 2
#3 C 3
But it won't work if there's duplicate rows in either dataframe that you want to preserve.
You should look dplyr's distint() and bind_rows() functions.
Or Better provide a dummy data to work on and expected output .
Suppose you have two dataframes a and b ,and you want to merge unique rows of a dataframe to the b dataframe
a = data.frame(
x = c(1,2,3,1,4,3),
y = c(5,2,3,5,3,3)
)
b = data.frame(
x = c(6,2,2,3,3),
y = c(19,13,12,3,1)
)
library(dplyr)
distinct(a) %>% bind_rows(.,b)

R applying a data frame on another data frame

I have two data frames.
set.seed(1234)
df <- data.frame(
id = factor(rep(1:24, each = 10)),
price = runif(20)*100,
quantity = sample(1:100,240, replace = T)
)
df2 <- data.frame(
id = factor(seq(1:24)),
eq.quantity = sample(1:100, 24, replace = T)
)
I would like to use df2$­eq.quantity to find the closest absolute value compared to df$quantity, by the factor variable, id. I would like to do that for each id in df2 and bind it into a new data-frame, called results.
I can do it like this for each individually ID:
d.1 <- df2[df2$id == 1, 2]
df.1 <- subset(df, id == 1)
id.1 <- df.1[which.min(abs(df.1$quantity-d.1)),]
Which would give the solution:
id price quantity
1 66.60838 84
But I would really like to be able to use a smarter solution, and also gathered the results into a dataframe, so if I do it manually it would look kinda like this:
results <- cbind(id.1, id.2, etc..., id.24)
I had some trouble giving this question a good name?
data.tables are smart!
Adding this to your current example...
library(data.table)
dt = data.table(df)
dt2 = data.table(df2)
setkey(dt, id)
setkey(dt2, id)
dt[dt2, dif:=abs(quantity - eq.quantity)]
dt[,list(price=price[which.min(dif)], quantity=quantity[which.min(dif)]), by=id]
result:
dt[,list(price=price[which.min(dif)], quantity=quantity[which.min(dif)]), by=id]
id price quantity
1: 1 66.6083758 84
2: 2 29.2315840 19
3: 3 62.3379442 63
4: 4 54.4974836 31
5: 5 66.6083758 6
6: 6 69.3591292 13
...
Merge the two datasets and use lapply to perform the function on each id.
df3 <- merge(df,df2,all.x=TRUE,by="id")
diffvar <- function(df){
df4 <- subset(df3, id == df)
df4[which.min(abs(df4$quantity-df4$eq.quantity)),]
}
resultslist <- lapply(levels(df3$id),function(df) diffvar(df))
Combine the resulting list elements in a dataframe:
resultsdf <- data.frame(matrix(unlist(resultslist), ncol=4, byrow=T))
Or more easy:
library(plyr)
resultsdf <- ddply(df3, .(id), function(x)x[which.min(abs(x$quantity-x$eq.quantity)),])

Extract factor column from data frame

My data frame is breaking when i extract some rows from a factor column:
data.df = data.frame(x = factor(letters[1:10]))
data.temp = data.df[1:3, ]
print(data.temp)
How can i avoid that? I need to column name to be kept also. Thanks!
You can add argument drop=FALSE to keep data as data frame.
data.df = data.frame(x = factor(letters[1:10]))
data.temp = data.df[1:3, ,drop=FALSE]
print(data.temp)
x
1 a
2 b
3 c

Resources