How can I create a function to generate new variables based on values in different dataframe in R - r

I would like to create a function like this (obviously not proper code):
forEach ID in DATAFRAME1 look at each row with ID in DATAFRAME2 {
if DATAFRAME2$VARIABLE1 = something {
DATAFRAME1$VARIABLE1 = TRUE;
DATAFRAME1$VARIABLE2 = DATAFRAME2$VARIABLE2
}
}
In plain text, I've got a list of individuals and a database with mixed information on these
individuals. Let's say DATAFRAME2 contains informations on books read c(id, title, author, date). I want to create a new variable in DATAFRAME1 with a boolean of if the individual has read a specific book (VARIABLE1 above) and the date they first read it (VARIABLE2above). Also adding a third variable with number of times read would be interesting but not neccesary.
I haven't really done this in R before, mostly doing basic statistics and basic wrangling with dplyr. I guess I could use dplyr and join but this feels like a better approach. Any help to get me started would be much appreciated.

The following function does what the question asks for. Its arguments are
DF1 and DF2 have an obvious meaning;
var1 and var2 are VARIABLE1 and VARIABLE2 in the question;
value is the value of something.
The test data is at the end.
fun <- function(DF1, DF2, ID = 'ID', var1, var2, value){
DF1[[var1]] <- NA
DF1[[var2]] <- NA
k <- DF2[[var1]] == value
for(id in df1[[ID]]){
i <- DF1[[ID]] == id
j <- DF2[[ID]] == id
if(any(j & k)){
DF1[[var1]][i] <- TRUE
DF1[[var2]][i] <- DF2[[var2]][j & k]
}
}
DF1
}
fun(df1, df2, value = 4, var1 = 'X', var2 = 'Y')
# ID X Y
#1 a NA NA
#2 d TRUE 19
Test data.
set.seed(1234)
df1 <- data.frame(ID = c("a", "d"))
df2 <- data.frame(ID = rep(letters[1:5], 4),
X = sample(20, 20, TRUE),
Y = sample(20))

Related

R: insert rows at specific places in dataframe

I can't seem to find an example to help me solve a particular problem in R. I have a data frame that looks like this:
tmp = data.frame(group = c(rep("A", 5), rep("B",2), rep("C",6)), value = rnorm(13))
In reality I have thousands of columns and rows with many different values for group. The rows in the data frame are ordered by group.
I'd like to insert a new row above the first occurrence of each group. I'd also like for these new rows to only contain a value (the same value) in the first column (although I can make do if columns 2:ncol(tmp) contain NAs). Using the example data frame above, the end result should look like this:
group value
GROUP
A -1.7596279
A -0.8273928
A -0.3515738
A -0.7547999
A 0.5700747
GROUP
B -1.9676482
B 0.3996858
GROUP
C 0.1047832
C 0.5903711
C -1.3687259
C 0.3688415
C 1.3674403
C 0.8880089
Is there a way to do this? I can come up with a list of rows containing the first instance of each group. I was originally thinking that I could use this information to define where new rows should be inserted, but not sure if this is the best way to go.
I tried to create a function that does what you want it to do:
addEmptyRows <- function(D)
{
output <- tmp
i <- 1
while (i < NROW(output)) {
if(output$group[i] != output$group[i+1])
{
output <- rbind(output[1:i,],c("GROUP","NA"),output[(i+1):NROW(output),])
i <- i+1
}
i <- i+1
}
return(rbind(c("GROUP","NA"),output))
}
If you apply this function to your dataframe:
addEmptyRows(tmp)
It gives you the desired dataframe. Does this help you?
You could use something like this:
tmp <- data.frame(group = c(rep("A", 5), rep("B",2), rep("C",6)), value = rnorm(13))
divider <- data.frame(group = "GROUP", value = NA)
do.call(rbind, unlist(lapply(split(tmp, tmp$group),
function(x) list(divider, x)), recursive = F))

Compare two data.frames to find similar values from data.frame 1 in data.frame 2

I have two data.frames: name and searches
name <- data.frame(
A = c("example", "firstly", "second.com")
searches <- data.frame(
A = c("example.com","secondly","first"),
B = c("test", "test.com", "test1"))
I want to search in data.frame "searches" for the values in data.frame "name". If there is a similar value (not exactly the same) I want R to return the value from name and from searches in a new row in a new table.
So a new data.frame could be
result <- data.frame(
A = "example", "firstly", "second.com",
B = "example.com","first","secondly",
C = "test", "test1", "test.com")
Is that possible?
You can use the stringr package in R to do this. For example, if you have
name <- data.frame(
A = c("example", "firstly", "second.com"))
searches <- data.frame(
A = c("example.com","secondly","first"),
B = c("test", "test.com", "test1"))
then you can use
str_extract(searches$A, '.*example.*')
This gives an output of
> str_extract(searches$A, '.*example.*')
[1] "example.com" NA NA
If you set this up with an appropriate for loop to iterate over elements in your name dataframe and cells of your searches dataframe then you could pick up all matches and extract them as desired.
use the function stringdist_join from the fuzzyjoin package.
library(fuzzyjoin)
name <- data.frame(
A = c("example", "firstly", "second.com")
)
searches <- data.frame(
A = c("example.com","secondly","first"),
B = c("test", "test.com", "test1")
)
result <- stringdist_join(name, searches, by = "A", max_dist = 5)
Which results to:
> print(result)
A.x A.y B
1 example example.com test
2 firstly first test1
3 second.com secondly test.com

Case usage in R:Count number of events from Table 2 when case in Table 1 satisfy specific restrictions

The DF for Table 1 is like this:
df1 <- data.frame(ID = c('001','001','002','003', '003', '003'),
date = c('2015-05-23', '2015-07-29', '2015-08-08', '2015-06-10', '2015-10-12', '2015-11-15'),
date_last = c('2015-01-20', '2015-05-23', '2015-05-15', '2015-01-20', '2015-06-10', '2015-10-12'))
And the DF for Table 2 is like this:
df2 <- data.frame(Event = c('A', 'B', 'C', 'D', 'E'),
Event_date = c('2015-01-21', '2015-01-21', '2015-03-29', '2015-08-12', '2015-10-12'))
what I want to get is to get case when df1$date_last < df2$Event_date < df1$date, then count(Event) as 1 and sum up how many events during the time period. The ideal result I want to have is like the following:
df3 <- data.frame(ID = c('001','001','002','003', '003', '003'),
date = c('2015-05-23', '2015-07-29', '2015-02-08', '2015-06-10', '2015-10-12', '2015-11-15'),
date_last = c('2015-01-20', '2015-05-23', '2015-05-15', '2015-01-20', '2015-06-10', '2015-10-12'),
number_of_events = c(3,1,0,3,1,0))
Anyone know the R code for this? Thank you so much!
Make sure that all your dates are of class date. You simply to this by putting as.Date() around the columns in the creation of the data frames.
First define a function with x being a vector with end and start date respectively, and y being a vector with dates that should be checked.
nr_events_in_between <- function(x, y) sum(x[2] < y & x[1] > y)
Apply this to all rows in df1 and you get the number_of_events column.
apply(df1[ ,c('date', 'date_last')], 1, nr_events_in_between, df2[,'Event_date'])
(Note that for the second row the value is 0 not 1 as you state in the example for df3)

R applying a data frame on another data frame

I have two data frames.
set.seed(1234)
df <- data.frame(
id = factor(rep(1:24, each = 10)),
price = runif(20)*100,
quantity = sample(1:100,240, replace = T)
)
df2 <- data.frame(
id = factor(seq(1:24)),
eq.quantity = sample(1:100, 24, replace = T)
)
I would like to use df2$­eq.quantity to find the closest absolute value compared to df$quantity, by the factor variable, id. I would like to do that for each id in df2 and bind it into a new data-frame, called results.
I can do it like this for each individually ID:
d.1 <- df2[df2$id == 1, 2]
df.1 <- subset(df, id == 1)
id.1 <- df.1[which.min(abs(df.1$quantity-d.1)),]
Which would give the solution:
id price quantity
1 66.60838 84
But I would really like to be able to use a smarter solution, and also gathered the results into a dataframe, so if I do it manually it would look kinda like this:
results <- cbind(id.1, id.2, etc..., id.24)
I had some trouble giving this question a good name?
data.tables are smart!
Adding this to your current example...
library(data.table)
dt = data.table(df)
dt2 = data.table(df2)
setkey(dt, id)
setkey(dt2, id)
dt[dt2, dif:=abs(quantity - eq.quantity)]
dt[,list(price=price[which.min(dif)], quantity=quantity[which.min(dif)]), by=id]
result:
dt[,list(price=price[which.min(dif)], quantity=quantity[which.min(dif)]), by=id]
id price quantity
1: 1 66.6083758 84
2: 2 29.2315840 19
3: 3 62.3379442 63
4: 4 54.4974836 31
5: 5 66.6083758 6
6: 6 69.3591292 13
...
Merge the two datasets and use lapply to perform the function on each id.
df3 <- merge(df,df2,all.x=TRUE,by="id")
diffvar <- function(df){
df4 <- subset(df3, id == df)
df4[which.min(abs(df4$quantity-df4$eq.quantity)),]
}
resultslist <- lapply(levels(df3$id),function(df) diffvar(df))
Combine the resulting list elements in a dataframe:
resultsdf <- data.frame(matrix(unlist(resultslist), ncol=4, byrow=T))
Or more easy:
library(plyr)
resultsdf <- ddply(df3, .(id), function(x)x[which.min(abs(x$quantity-x$eq.quantity)),])

Matching data from unequal length data frames in r

This seems like it should be really simple. Ive 2 data frames of unequal length in R. one is simply a random subset of the larger data set. Therefore, they have the same exact data and a UniqueID that is exactly the same. What I would like to do is put an indicator say a 0 or 1 in the larger data set that says this row is in the smaller data set.
I can use which(long$UniqID %in% short$UniqID) but I can't seem to figure out how to match this indicator back to the long data set
Made same sample data.
long<-data.frame(UniqID=sample(letters[1:20],20))
short<-data.frame(UniqID=sample(letters[1:20],10))
You can use %in% without which() to get values TRUE and FALSE and then with as.numeric() convert them to 0 and 1.
long$sh<-as.numeric(long$UniqID %in% short$UniqID)
I'll use #AnandaMahto's data to illustrate another way using duplicated which also works if you've a unique ID or not.
Case 1: Has unique id column
set.seed(1)
df1 <- data.frame(ID = 1:10, A = rnorm(10), B = rnorm(10))
df2 <- df1[sample(10, 4), ]
transform(df1, indicator = 1 * duplicated(rbind(df2, df1)[, "ID",
drop=FALSE])[-seq_len(nrow(df2))])
Case 2: Has no unique id column
set.seed(1)
df1 <- data.frame(A = rnorm(10), B = rnorm(10))
df2 <- df1[sample(10, 4), ]
transform(df1, indicator = 1 * duplicated(rbind(df2, df1))[-seq_len(nrow(df2))])
The answers so far are good. However, a question was raised, "what if there wasn't a "UniqID" column?
At that point, perhaps merge can be of assistance:
Here's an example using merge and %in% where an ID is available:
set.seed(1)
df1 <- data.frame(ID = 1:10, A = rnorm(10), B = rnorm(10))
df2 <- df1[sample(10, 4), ]
temp <- merge(df1, df2, by = "ID")$ID
df1$matches <- as.integer(df1$ID %in% temp)
And, a similar example where an ID isn't available.
set.seed(1)
df1_NoID <- data.frame(A = rnorm(10), B = rnorm(10))
df2_NoID <- df1_NoID[sample(10, 4), ]
temp <- merge(df1_NoID, df2_NoID, by = "row.names")$Row.names
df1_NoID$matches <- as.integer(rownames(df1_NoID) %in% temp)
You can directly use the logical vector as a new column:
long$Indicator <- 1*(long$UniqID %in% short$UniqID)
See if this can get you started:
long <- data.frame(UniqID=sample(1:100)) #creating a long data frame
short <- data.frame(UniqID=long[sample(1:100, 30), ]) #creating a short one with the same ids.
long$indicator <- long$UniqID %in% short$UniqID #creating an indicator column in long.
> head(long)
UniqID indicator
1 87 TRUE
2 15 TRUE
3 100 TRUE
4 40 FALSE
5 89 FALSE
6 21 FALSE

Resources