I am attempting to replace an inefficient nested for loop that will not run on a large dataset with the apply function.
unique <- cbind.data.frame(c(1,2,3))
colnames(unique) <- "note"
ptSeensub <- rbind.data.frame(c(1,"a"), c(1,"b"), c(2,"a"), c(2,"d"), c(3,"e"), c(3,"f"))
colnames(ptSeenSub) <- c("PARENT_EVENT_ID", "USER_NAME")
uniqueRow <- nrow(unique)
ptSeenSubRow <- nrow(ptSeenSubRow)
for (note in 1:uniqueRow)
{
for (row in 1:ptSeenSubRow)
{
if (ptSeenSub$PARENT_EVENT_ID[row] == unique$note[note])
{
unique$attending_name[note] <- ptSeenSub$USER_NAME[row]
unique$attending_name[note] <- ptSeenSub$USER_NAME[row +1]
}
}
}
I would like the results to be similar to this dataframe:
results <- rbind.data.frame(c(1, "a", "b"), c(2, "a", "d"), c(3,"e", "f"))
colnames(results) <- c("note", "attending_name", "resident_name")
The loop will be running over millions of rows and will not finish. How can I vectorize this to finish over large data sets? Any advice is greatly apprecaited
Sounds like you are trying to reshape data into wide format. I find that dplyr and tidyr find nice tools to accomplish this.
define data
library(tidyr)
library(dplyr)
ptSeenSub <- rbind.data.frame(c(1,"a"), c(1,"b"), c(2,"a"), c(2,"d"), c(3,"e"), c(3,"f"))
reshape
result <- ptSeenSub %>%
group_by(PARENT_EVENT_ID) %>%
mutate(k = row_number()) %>%
spread(k, USER_NAME)
You can then change names if you wish:
names(result) <- c("notes", "attending_name", "resident_name")
You could also use dcast from either reshape2 or the devel version of data.table (should be fast) i.e. v1.9.5
library(data.table)
setnames(dcast(setDT(ptSeensub)[, N:= 1:.N, PARENT_EVENT_ID],
PARENT_EVENT_ID~N, value.var='USER_NAME'),
c('note', 'attending_name', 'resident_name'))[]
# note attending_name resident_name
#1: 1 a b
#2: 2 a d
#3: 3 e f
If there are only two observations per each 'PARENT_EVENT_ID'
setDT(ptSeensub)[,.(attending_name=USER_NAME[1L],
resident_name=USER_NAME[2L]) , .(note=PARENT_EVENT_ID)]
# note attending_name resident_name
#1: 1 a b
#2: 2 a d
#3: 3 e f
Related
Hi I'd like to groupby two dataframe columns, and apply a function to aother two dataframe columns.
For e.g.,
ticker <- c("A", "A", 'A', "B", "B", "B")
date <- c(1,1,2,1,2,1)
ret <- c(1,2,4,6,9,5)
vol <- c(3,5,1,6,2,3)
dat <- data.frame(ticker,date,ret,vol)
For each ticker and each date, I'd like to calculate its PIN.
Now, to avoid further confusion, perhaps it helps to just speak out the actual function. YZ is a function in the InfoTrad package, and YZ only accepts a dataframe with two columns. It uses some optimisation tool and returns an estimated PIN.
install.packages(InfoTrad)
library(InfoTrad)
get_pin_yz <- function(data) {
return(YZ(data[ ,c('volume_krw_buy', 'volume_krw_sell')])[['PIN']])
}
I know how to do this in R using for loop. But for loop is very computationally costly, and it might take weeks to finish running my large dataset. Thus, I would like to ask how to do this using groupby.
# output format is wide wrt long format as "dat"
dat_w <- data.frame(ticker = NA, date = NA, PIN = NA)
for (j in c("A", "B")){
for (k in c(1:2)){
subset <- dat %>% subset((ticker == j & date == k), select = c('ret', "vol"))
new_row <- data.frame(ticker = j, date = k, PIN = YZ(subset)$PIN)
dat_w <- rbind(dat_w, new_row)
}
}
dat_w <- dat_w[-1, ]
dat_w
Don't know if this can help you help me -- I know how to do this in python: I just write a function and run df.groupby(['ticker','date']).apply(function).
Finally, the wanted dataframe is:
ticker <- c('A','A','B','B')
date <- c(1,2,1,2)
PIN <- c(1.05e-17,2.81e-09,1.12e-08,5.39e-09)
data.frame(ticker,date,PIN)
Could somebody help out, please?
Thank you!
Best,
Darcy
Previous stuff (Feel free to ignore)
Previously, I wrote this:
My function is:
get_rv <- function(data) {
return(data[['vol']] + data[['ret']])
}
What I want is:
ticker_wanted <- c('A','A', 'B', 'B')
date_wanted <- c(1,2,1,2)
rv_wanted <- c(7,5,10,11)
df_wanted <-data.frame(ticker_wanted,date_wanted,rv_wanted)
But this is not literally what my actual function is. The vol+ret is just an example. I'm more interested in the more general case: how to groupby and apply a general function to two or more dataframes. I use the vol + ret just because I didn't want to bother others by asking them to install some potentially irrelevant package on their PC.
Update based on real-life example:
You can do a direct approach like this:
library(tidyverse)
library(InfoTrad)
dat %>%
group_by(ticker, date) %>%
summarize(PIN = YZ(as.data.frame(cur_data()))$PIN)
# A tibble: 4 x 3
# Groups: ticker [2]
ticker date PIN
<chr> <dbl> <dbl>
1 A 1 1.05e-17
2 A 2 1.56e- 1
3 B 1 1.12e- 8
4 B 2 7.07e- 9
The difficulty here was that the YZ function only accepts true data frames, not tibbles and that it returns several values, not just PIN.
You could theoretically wrap this up into your own function and then run your own function like I‘ve shown in the example below, but maybe this way already does the trick.
I also don‘t expect this to run much faster than a for loop. It seems that this YZ function has some more-than-linear runtime, so passing larger amount of data will still take some time. You can try to start with a small set of data and then repeat it by increasing the size of your data with a factor of maybe 10 and then check how fast it runs.
In your example, you can do:
my_function <- function(data) {
data %>%
summarize(rv = sum(ret, vol))
}
library(tidyverse)
df %>%
group_by(ticker, date) %>%
my_function()
# A tibble: 4 x 3
# Groups: ticker [2]
ticker date rv
<chr> <dbl> <dbl>
1 A 1 7
2 A 2 5
3 B 1 10
4 B 2 11
But as mentioned in my comment, I‘m not sure if this general example would help in your real-life use case.
Might also be that you don‘t need to create your own function because built-in functions already exist. Like in the example, you sre better off with directly summarizing instead of wrapping it into a function.
you could just do this? (with summarise as an example of your function):
ticker <- c("A", "A", 'A', "B", "B", "B")
date <- c(1,1,2,1,2,1)
ret <- c(1,-2,4,6,9,-5)
vol <- c(3,5,1,6,2,3)
df <- data.frame(ticker,date,ret,vol)
df_wanted <- get_rv(df)
get_rv <- function(data){
result <- data %>%
group_by(ticker,date) %>%
summarise(rv =sum(ret) + sum(vol)) %>%
as.data.frame()
names(result) <- c('ticker_wanted', 'date_wanted', 'rv_wanted')
return(result)
}
Assuming that your dataframe is as follows:
data <- data.frame(ticker,date,ret,vol)
Use split to split your dataframe into a group of dataframes bases on the values of ticker, and date.
dflist = split(data, f = list(data$ticker, data$date), drop = TRUE)
Now use lapply or sapply to run the function YZ() on each dataframe member of dflist.
pins <- lapply(dflist, function(x) YZ(x)$PIN)
I have a vector containing "potential" column names:
col_vector <- c("A", "B", "C")
I also have a data frame, e.g.
library(tidyverse)
df <- tibble(A = 1:2,
B = 1:2)
My goal now is to create all columns mentioned in col_vector that don't yet exist in df.
For the above exmaple, my code below works:
df %>%
mutate(!!sym(setdiff(col_vector, colnames(.))) := NA)
# A tibble: 2 x 3
A B C
<int> <int> <lgl>
1 1 1 NA
2 2 2 NA
Problem is that this code fails as soon as a) more than one column from col_vector is missing or b) no column from col_vector is missing. I thought about some sort of if_else, but don't know how to make the column creation conditional in such a way - preferably in a tidyverse way. I know I can just create a loop going through all the missing columns, but I'm wondering if there is a more direc approach.
Example data where code above fails:
df2 <- tibble(A = 1:2)
df3 <- tibble(A = 1:2,
B = 1:2,
C = 1:2)
This should work.
df[,setdiff(col_vector, colnames(df))] <- NA
Solution
This base operation might be simpler than a full-fledged dplyr workflow:
library(tidyverse) # For the setdiff() function.
# ...
# Code to generate 'df'.
# ...
# Find the subset of missing names, and create them as columns filled with 'NA'.
df[, setdiff(col_vector, names(df))] <- NA
# View results
df
Results
Given your sample col_vector and df here
col_vector <- c("A", "B", "C")
df <- tibble(A = 1:2, B = 1:2)
this solution should yield the following results:
# A tibble: 2 x 3
A B C
<int> <int> <lgl>
1 1 1 NA
2 2 2 NA
Advantages
An advantage of my solution, over the alternative linked above by #geoff, is that you need not code by hand the set of column names, as symbols and strings within the dplyr workflow.
df %>% mutate(
#####################################
A = ifelse("A" %in% names(.), A, NA),
B = ifelse("B" %in% names(.), B, NA),
C = ifelse("C" %in% names(.), B, NA)
# ...
# etc.
#####################################
)
My solution is by contrast more dynamic
##############################
df[, setdiff(col_vector, names(df))] <- NA
##############################
if you ever decide to change (or even dynamically calculate!) your variable names midstream, since it determines the setdiff() at runtime.
Note
Incredibly, #AustinGraves posted their answer at precisely the same time (2021-10-25 21:03:05Z) as I posted mine, so both answers qualify as original solutions.
I want to extract a column from a dataframe in R based on a condition for another column in the same dataframe, the dataframe is given below.
b <- c(1,2,3,4)
g <- c("a", "b" ,"b", "c")
df <- data.frame(b,g)
row.names(df) <- c("aa", "bb", "cc" , "dd")
I want to extract all values for column b as a dataframe (with rownames) where column g has value 'b',
My required output is given below:
df
b
cc 3
dd 4
I have tried several methods like which or subset but it does not work. I have also tried to find the answer to this question on stackoverflow but I was not able to find it. Is there a way to do it?
Thanks,
You can use the subset function in base R -
subset(df, g == 'b', select = b)
# b
#bb 2
#cc 3
Using data.table
library(data.table)
setDT(df, key = 'g')['b', .(b)]
b
1: 2
2: 3
Or with collapse
library(collapse)
sbt(df, g == 'b', b)
b
1 2
2 3
This is the basic way of slicing data in r
df[df$g == 'b',]['b']
Or the tidyverse answer
df %>%
filter(g == 'b') %>%
select(b)
This question already has answers here:
How to delete multiple values from a vector?
(9 answers)
Closed 3 years ago.
I have a vector of values and a data frame.
I would like to filter out the rows of the data frame which contain (in specific column) any of the values in my vector.
I'm trying to figure out if a person in the survey has a child who was also questioned in the survey - if so I would like to remove them from my data frame.
I have a list of respondent IDs, and vectors of mother/father personal IDs. If the ID appears in the mother/father column I would like to remove it.
df <- data.frame(ID= c(101,102,103,104,105), Name = (Martin, Sammie, Reg, Seamus, Aine)
vec <- c(103,105,108,120,150)
Output should be a dataframe with three rows - Martin, Sammie, Seamus.
ID Name
1 101 Martin
2 102 Sammie
3 104 Seamus
df[!(df$ID %in% vec), ] # Or subset(df, !(ID %in% vec))
# ID Name
# 1 101 Martin
# 2 102 Sammie
# 4 104 Seamus
Data
df <- data.frame(ID= c(101,102,103,104,105), Name = c("Martin", "Sammie", "Reg", "Seamus", "Aine"))
vec <- c(103,105,108,120,150)
You can do this with filter from dplyr
library(tidyverse)
df2 <- df%>%
filter(!ID %in% vec)
If you create this as a data.table (and load data.table package, and fix the errors in the example data):
library(data.table)
df <- data.table(ID= c(101,102,103,104,105), Name = c("Martin", "Sammie", "Reg", "Seamus", "Aine"))
vec <- c(103,105,108,120,150)
# solution, slightly different from base R
df[!(ID %in% vec)]
Data.table is likely going to run a bit quicker than base R so very useful with large datasets. Microbenchmarking with a large dataset using base R, tidyverse and data.table shows data.table to be a bit quicker than tidyverse and a lot faster than base.
library(tidyverse)
library(data.table)
library(microbenchmark)
n <- 10000000
df <- data.frame("ID" = c(1:n), "Name" = sample(LETTERS, size = n, replace = TRUE))
dt <- data.table(df)
vec <- sample(1:n, size = n/10, replace = FALSE)
microbenchmark(dt[!(ID %in% vec)], df[!(df$ID %in% vec),], df%>% filter(!ID %in% vec))
I have a dataset like this
id <- 1:12
b <- c(0,0,1,2,0,1,1,2,2,0,2,2)
c <- rep(NA,3)
d <- rep(NA,3)
df <-data.frame(id,b)
newdf <- data.frame(c,d)
I want to do simple math. If x==1 or x==2 count them and write how many 1 and 2 are there in this dataset. But I don't want to count whole dataset, I want my function count them four by four.
I want to a result like this:
> newdf
one two
1 1 1
2 2 1
3 0 3
I tried this with lots of variation but I couldn't success.
afonk <- function(x) {
ifelse(x==1 | x==2, x, newdf <- (x[1]+x[2]))
}
afonk(newdf$one)
lapply(newdf, afonk)
Thanks in advance!
ismail
Fun with base R:
# counting function
countnum <- function(x,num){
sum(x == num)
}
# make list of groups of 4
df$group <- rep(1:ceiling(nrow(df)/4),each = 4)[1:nrow(df)]
dfl <- split(df$b,f = df$group)
# make data frame of counts
newdf <- data.frame(one = sapply(dfl,countnum,1),
two = sapply(dfl,countnum,2))
Edit based on comment:
# make list of groups of 4
df$group <- rep(1:ceiling(nrow(df)/4),each = 4)[1:nrow(df)]
table(subset(df, b != 0L)[c("group", "b")])
Which you prefer depends on what type of result you need. A table will work for a small visual count, and you can likely pull the data out of the table, but if it is as simple as your example, you might opt for the data.frame.
We could use dcast from data.table. Create a grouping variable using %/% and then dcast from 'long' to 'wide' format.
library(data.table)
dcast(setDT(df)[,.N ,.(grp=(id-1)%/%4+1L, b)],
grp~b, value.var='N', fill =0)[,c(2,4), with=FALSE]
Or a slightly more compact version would be using fun.aggregate as length.
res <- dcast(setDT(df)[,list((id-1)%/%4+1L, b)][b!=0],
V1~b, length)[,V1:=NULL][]
res
# 1 2
#1: 1 1
#2: 2 1
#3: 0 3
If we need the column names to be 'one', 'two'
library(english)
names(res) <- as.character(english(as.numeric(names(res))))