Related
I have a spreadsheet Y that has thousands of rows
I would like to extract a few hundred specific rows identified by key X
I used intersect function to generate object Z as below, but do not know how to proceed.
Many thanks
Z<-intersect(X$PatientID, Y$Patient.ID)
Try using %in%:
Sample Data
Y <- data.frame(PatientID = LETTERS,
Var1 = 1:26)
key <- data.frame(PatientID = c("A", "W", "V"),
Var1 = c(1, 22:23))
Using %in%:
want <- Y[Y$PatientID %in% key$PatientID, ]
Output:
# PatientID Var1
# 1 A 1
# 22 V 22
# 23 W 23
Note if you needed to use intersect, you would just do this:
Z <- intersect(Y$PatientID, key$PatientID)
want <- Y[Y$PatientID %in% Z,]
And it would give you the same output
You can merge:
base R
merge(Y,unique(X[,"PatientID",drop=F]))
dplyr:
dplyr::inner_join(Y, dplyr::distinct(X,PatientID))
Hi I'd like to groupby two dataframe columns, and apply a function to aother two dataframe columns.
For e.g.,
ticker <- c("A", "A", 'A', "B", "B", "B")
date <- c(1,1,2,1,2,1)
ret <- c(1,2,4,6,9,5)
vol <- c(3,5,1,6,2,3)
dat <- data.frame(ticker,date,ret,vol)
For each ticker and each date, I'd like to calculate its PIN.
Now, to avoid further confusion, perhaps it helps to just speak out the actual function. YZ is a function in the InfoTrad package, and YZ only accepts a dataframe with two columns. It uses some optimisation tool and returns an estimated PIN.
install.packages(InfoTrad)
library(InfoTrad)
get_pin_yz <- function(data) {
return(YZ(data[ ,c('volume_krw_buy', 'volume_krw_sell')])[['PIN']])
}
I know how to do this in R using for loop. But for loop is very computationally costly, and it might take weeks to finish running my large dataset. Thus, I would like to ask how to do this using groupby.
# output format is wide wrt long format as "dat"
dat_w <- data.frame(ticker = NA, date = NA, PIN = NA)
for (j in c("A", "B")){
for (k in c(1:2)){
subset <- dat %>% subset((ticker == j & date == k), select = c('ret', "vol"))
new_row <- data.frame(ticker = j, date = k, PIN = YZ(subset)$PIN)
dat_w <- rbind(dat_w, new_row)
}
}
dat_w <- dat_w[-1, ]
dat_w
Don't know if this can help you help me -- I know how to do this in python: I just write a function and run df.groupby(['ticker','date']).apply(function).
Finally, the wanted dataframe is:
ticker <- c('A','A','B','B')
date <- c(1,2,1,2)
PIN <- c(1.05e-17,2.81e-09,1.12e-08,5.39e-09)
data.frame(ticker,date,PIN)
Could somebody help out, please?
Thank you!
Best,
Darcy
Previous stuff (Feel free to ignore)
Previously, I wrote this:
My function is:
get_rv <- function(data) {
return(data[['vol']] + data[['ret']])
}
What I want is:
ticker_wanted <- c('A','A', 'B', 'B')
date_wanted <- c(1,2,1,2)
rv_wanted <- c(7,5,10,11)
df_wanted <-data.frame(ticker_wanted,date_wanted,rv_wanted)
But this is not literally what my actual function is. The vol+ret is just an example. I'm more interested in the more general case: how to groupby and apply a general function to two or more dataframes. I use the vol + ret just because I didn't want to bother others by asking them to install some potentially irrelevant package on their PC.
Update based on real-life example:
You can do a direct approach like this:
library(tidyverse)
library(InfoTrad)
dat %>%
group_by(ticker, date) %>%
summarize(PIN = YZ(as.data.frame(cur_data()))$PIN)
# A tibble: 4 x 3
# Groups: ticker [2]
ticker date PIN
<chr> <dbl> <dbl>
1 A 1 1.05e-17
2 A 2 1.56e- 1
3 B 1 1.12e- 8
4 B 2 7.07e- 9
The difficulty here was that the YZ function only accepts true data frames, not tibbles and that it returns several values, not just PIN.
You could theoretically wrap this up into your own function and then run your own function like I‘ve shown in the example below, but maybe this way already does the trick.
I also don‘t expect this to run much faster than a for loop. It seems that this YZ function has some more-than-linear runtime, so passing larger amount of data will still take some time. You can try to start with a small set of data and then repeat it by increasing the size of your data with a factor of maybe 10 and then check how fast it runs.
In your example, you can do:
my_function <- function(data) {
data %>%
summarize(rv = sum(ret, vol))
}
library(tidyverse)
df %>%
group_by(ticker, date) %>%
my_function()
# A tibble: 4 x 3
# Groups: ticker [2]
ticker date rv
<chr> <dbl> <dbl>
1 A 1 7
2 A 2 5
3 B 1 10
4 B 2 11
But as mentioned in my comment, I‘m not sure if this general example would help in your real-life use case.
Might also be that you don‘t need to create your own function because built-in functions already exist. Like in the example, you sre better off with directly summarizing instead of wrapping it into a function.
you could just do this? (with summarise as an example of your function):
ticker <- c("A", "A", 'A', "B", "B", "B")
date <- c(1,1,2,1,2,1)
ret <- c(1,-2,4,6,9,-5)
vol <- c(3,5,1,6,2,3)
df <- data.frame(ticker,date,ret,vol)
df_wanted <- get_rv(df)
get_rv <- function(data){
result <- data %>%
group_by(ticker,date) %>%
summarise(rv =sum(ret) + sum(vol)) %>%
as.data.frame()
names(result) <- c('ticker_wanted', 'date_wanted', 'rv_wanted')
return(result)
}
Assuming that your dataframe is as follows:
data <- data.frame(ticker,date,ret,vol)
Use split to split your dataframe into a group of dataframes bases on the values of ticker, and date.
dflist = split(data, f = list(data$ticker, data$date), drop = TRUE)
Now use lapply or sapply to run the function YZ() on each dataframe member of dflist.
pins <- lapply(dflist, function(x) YZ(x)$PIN)
I would like to create a function like this (obviously not proper code):
forEach ID in DATAFRAME1 look at each row with ID in DATAFRAME2 {
if DATAFRAME2$VARIABLE1 = something {
DATAFRAME1$VARIABLE1 = TRUE;
DATAFRAME1$VARIABLE2 = DATAFRAME2$VARIABLE2
}
}
In plain text, I've got a list of individuals and a database with mixed information on these
individuals. Let's say DATAFRAME2 contains informations on books read c(id, title, author, date). I want to create a new variable in DATAFRAME1 with a boolean of if the individual has read a specific book (VARIABLE1 above) and the date they first read it (VARIABLE2above). Also adding a third variable with number of times read would be interesting but not neccesary.
I haven't really done this in R before, mostly doing basic statistics and basic wrangling with dplyr. I guess I could use dplyr and join but this feels like a better approach. Any help to get me started would be much appreciated.
The following function does what the question asks for. Its arguments are
DF1 and DF2 have an obvious meaning;
var1 and var2 are VARIABLE1 and VARIABLE2 in the question;
value is the value of something.
The test data is at the end.
fun <- function(DF1, DF2, ID = 'ID', var1, var2, value){
DF1[[var1]] <- NA
DF1[[var2]] <- NA
k <- DF2[[var1]] == value
for(id in df1[[ID]]){
i <- DF1[[ID]] == id
j <- DF2[[ID]] == id
if(any(j & k)){
DF1[[var1]][i] <- TRUE
DF1[[var2]][i] <- DF2[[var2]][j & k]
}
}
DF1
}
fun(df1, df2, value = 4, var1 = 'X', var2 = 'Y')
# ID X Y
#1 a NA NA
#2 d TRUE 19
Test data.
set.seed(1234)
df1 <- data.frame(ID = c("a", "d"))
df2 <- data.frame(ID = rep(letters[1:5], 4),
X = sample(20, 20, TRUE),
Y = sample(20))
I have the following code that it taking forever to run on my 80k rows CBP table. Anyone could help me optimize my loop. Trying simply to find duplicates sharing the same values in certain (not all) columns, getting the number of duplicates there is and then returning the ids for each of the duplicates:
for (row in 1:nrow(CBP)){
subs <- subset(CBP, CBP$Lower_Bound__c == CBP[row,"Lower_Bound__c"] & CBP$Price_Book__c == CBP[row,"Price_Book__c"] & CBP$Price__c == CBP[row,"Price__c"] & CBP$Product__c == CBP[row,"Product__c"] & CBP$Department__c == CBP[row,"Department__c"] & CBP$UOM__c == CBP[row,"UOM__c"] & CBP$Upper_Bound__c == CBP[row,"Upper_Bound__c"])
if (nrow(subs)>1){
CBP[row,]$dup <- nrow(subs)
CBP[row,]$dupids <- paste(subs[,"Id"], collapse = ",")
}
print(row)
}
I'm having a hard time understanding your example. However, here's a simple approach with data.table that might work for your situation. You can create a variable (nsame in the example) that counts if the something is a duplicate by multiple variables (var1 and var2 in the example). Then just grab the row index.
library(data.table)
# generate some example data
dt <- data.table(
var1 = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
var2 = c("a", "a", "z", "b", "y", "b", "c", "c", "c"),
var3 = 1:9
)
# counter for each combination of var1-var2
dt[ , nsame := 1:.N, by=.(var1, var2)]
# duplicates are where the counter is > 1
which(dt$nsame > 1)
## 2 6 8 9
Using base R:
dupe_columns = c(
"Lower_Bound__c", "Price_Book__c", "Price__c", "Product__c",
"Department__c", "UOM__c", "Upper_Bound__c"
)
# which rows are duplicated
dupes = which(duplicated(CBP[, dupe_columns]) | duplicated(CBP[, dupe_columns], fromLast = TRUE))
# how many are there
length(dupes)
# IDs that are duplicated
CBP[dupes, "Id"]
# collapse Ids with duplicates by group:
aggregate(CBP$Id, by = CBP[dupe_columns], FUN = paste, collapse = ",")
If any of this doesn't work or you need more help, post 10-20 rows of sample data (use dput() so it is copy/pasteable!!!) so we can test and verify.
Subtle point, but I use CBP[, dupe_columns] in the duplicated() line because duplicated() will work the same whether we give it a data frame or a vector. CBP[, dupe_columns] will be a data frame if you have more than one column to check for dupes, but will be a vector if you give it a single column. However, when we get down to aggregate we need the by argument to be a list (like a data frame). So I use CBP[dupe_columns] (no comma) which will guarantee a data frame even if we are only checking a single column.
This is a restatement of my poorly worded previous question. (To those who replied to it, I appreciate your efforts, and I apologize for not being as clear with my question as I should have been.) I have a large dataset, a subset of which might look like this:
a<-c(1,2,3,4,5,1)
b<-c("a","b","a","b","c","a")
c<-c("m","f","f","m","m","f")
d<-1:6
e<-data.frame(a,b,c,d)
If I want the sum of the entries in the fourth column based on a specific condition, I could do something like this:
attach(e)
total<-sum(e[which(a==3 & b=="a"),4])
detach(e)
However, I have a "vector" of conditions (call it condition_vector), the first four elements of which look more like this:
a==3 & b == "a"
a==2
a==1 & b=="a" & c=="m"
c=="f"
I'd like to create a "generalized" version of the "total" formula above that produces a results_vector of totals by reading in the condition_vector of conditions. In this example, the first four entries in the results_vector would be calculated conceptually as follows:
results_vector[1]<-sum(e[which(a==3 & b=="a"),4])
results_vector[2]<-sum(e[which(a==2),4])
results_vector[3]<-sum(e[which(a==1 & b=="a" & c=="m"),4])
results_vector[4]<-sum(e[which(c=="f"),4])
My actual data set has more than 20 variables. So each record in the condition_vector can contain anywhere from 1 to more than 20 conditions (as opposed to between 1 and 3 conditions, used in this example).
Is there a way to accomplish this other than using a parse(eval(text= ... approach (which takes a long time to run on a relatively small dataset)?
Thanks in advance for any help you can provide (and again, I apologize that I wasn't as clear as I should have been last time around).
Spark
Here using a solution using eval(parse(text=..) here, even if obviously you find it slow:
cond <- c('a==3 & b == "a"','a==2','a==1 & b=="a" & c=="x"','c=="f"')
names(cond) <- cond
results_vector <- lapply(cond,function(x)
sum(dat[eval(parse(text=x)),"d"]))
$`a==3 & b == "a"`
[1] 3
$`a==2`
[1] 2
$`a==1 & b=="a" & c=="m"`
[1] 1
$`c=="f"`
[1] 11
The advantage of naming your conditions vector is to access to your results by condition.
results_vector[cond[2]]
$`a==2`
[1] 2
Here is a function that takes as arguments the condition in each column (if no condition in a column, then NA as argument) and sums in a selected column of a selected data.frame:
conds.by.col <- function(..., sumcol, DF) #NA if not condition in a column
{
conds.ls <- list(...)
res.ls <- vector("list", length(conds.ls))
for(i in 1: length(conds.ls))
{
res.ls[[i]] <- which(DF[,i] == conds.ls[[i]])
}
res.ls <- res.ls[which(lapply(res.ls, length) != 0)]
which_rows <- Reduce(intersect, res.ls)
return(sum(DF[which_rows , sumcol]))
}
Test:
a <- c(1,2,3,4,5,1)
b <- c("a", "b", "a", "b", "c", "a")
c <- c("m", "f", "f", "m", "m", "f")
d <- 1:6
e <- data.frame(a, b, c, d)
conds.by.col(3, "a", "f", sumcol = 4, DF = e)
#[1] 3
For multiple conditions, mapply:
#all conditions in a data.frame:
myconds <- data.frame(con1 = c(3, "a", "f"),
con2 = c(NA, "a", NA),
con3 = c(1, NA, "f"),
stringsAsFactors = F)
mapply(conds.by.col, myconds[1,], myconds[2,], myconds[3,], MoreArgs = list(sumcol = 4, DF = e))
#con1 con2 con3
# 3 10 6
I guess "efficiency" isn't the first you say watching this, though...