R data.table user defined function - r

I am transitioning from using data.frame in R to data.table for better performance. One of the main segments in converting code was applying custom functions from apply on data.frame to using it in data.table.
Say I have a simple data table, dt1.
x y z---header
1 9 j
4 1 n
7 1 n
Am trying to calculate another new column in dt1, based on values of x,y,z
I tried 2 ways, both of them give the correct result, but the faster one spits out a warning. So want to make sure the warning is nothing serious before I use the faster version in converting my existing code.
(1) dt1[,a:={if((x<1) & (y>3) & (j == "n")){6} else {7}}]
(2) dt1[,a:={if((x<1) & (y>3) & (j == "n")){6} else {7}}, by = 1:nrow(x)]
Version 1 runs faster than version 2, but spits out a warning" the condition has length > 1 and only the first element will be used"
But the result is good.
The second version is slightly slower but doesn't give that warning.
I wanted to make sure version one doesn't give erratic results once I start writing complicated functions.
Please treat the question as a generic one with the view to run a user defined function which wants to access different column values in a given row and calculate the new column value for that row.
Thanks for your help.

If 'x', 'y', and 'z' are the columns of 'dt1', try either the vectorized ifelse
dt1[, a:=ifelse(x<1 & y >3 & z=='n', 6, 7)]
Or create 'a' with 7, then assign 6 to 'a' based on the logical index.
dt1[, a := 7][x<1 & y >3 & z=='n', a:=6][]
Using a function
getnewvariable <- function(v1, v2, v3){
ifelse(v1 <1 & v2 >3 & v3=='n', 6, 7)
}
dt1[, a:=getnewvariable(x,y,z)][]
data
df1 <- structure(list(x = c(0L, 1L, 4L, 7L, -2L), y = c(4L, 9L, 1L,
1L, 5L), z = c("n", "j", "n", "n", "n")), .Names = c("x", "y",
"z"), class = "data.frame", row.names = c(NA, -5L))
dt1 <- as.data.table(df1)

Related

How can I apply case_when(mapply (adist, x, y) <= 3 ~ x, TRUE ~ y)) to columns of different length and order

Hi I have been trying for a while to match two large columns of names, several have different spellings etc... so far I have written some code to practice on a smaller dataset
examples%>% mutate(new_ID = case_when(mapply (adist, example_1 , example_2) <= 3 ~ example_1, TRUE ~ example_2))
This manages to create a new column with names the name from example 1 if it is less than an edit distance of 3 away. However, it does not give the name from example 2 if it does not meet this criteria which I need it to do.
This code also only works on the adjacent row of each column, whereas, I need it to work on a dataset which has two columns (one is larger- so cant be put in the same order).
Also needs to not try to match the NAs from the smaller column of names (there to fill it out to equal length to the other one).
Anyone know how to do something like this?
dput(head(examples))
structure(list(. = structure(c(4L, 3L, 2L, 1L, 5L), .Label = c("grarryfieldsred","harroldfrankknight", "sandramaymeres", "sheilaovensnew", "terrifrank"), class = "factor"), example_2 = structure(c(4L, 2L, 3L, 1L,
5L), .Label = c(" grarryfieldsred", "candramymars", "haroldfranrinight",
"sheilowansknew", "terryfrenk"), class = "factor")), row.names = c(NA,
5L), class = "data.frame")
The problem is that your columns have become factors rather than character vectors. When you try to combine two columns together with different factor levels, unexpected results can happen.
First convert your columns to character:
library(dplyr)
examples %>%
mutate(across(contains("example"),as.character)) %>%
mutate(new_ID = case_when(mapply (adist, example_1 , example_2) <= 3 ~ example_1,
TRUE ~ example_2))
# example_1 example_2 new_ID
#1 sheilaovensnew sheilowansknew sheilowansknew
#2 sandramaymeres candramymars candramymars
#3 harroldfrankknight haroldfranrinight harroldfrankknight
#4 grarryfieldsred grarryfieldsred grarryfieldsred
#5 terrifrank terryfrenk terrifrank
In your dput output, somehow the name of example_1 was changed. I ran this first:
names(examples)[1] <- "example_1"

remove rows if value matches that which was conditionally remove in r

I have a data frame. I'm trying to remove rows that have values in a column that match other rows that were conditionally removed. Let me provide a simple example for better explaining.
I'm tried using the previous post as a starting point:
Remove Rows From Data Frame where a Row match a String
>dat
A,B,C
4,3,Foo
2,3,Bar
1,2,Bar
7,5,Zap
First remove rows with "Foo" in column C:
dat[!grepl("Foo", dat$C),]
Now I want to remove any additional rows that have values in column B that match the values in rows with Foo. So in this example, any rows with B = 3 would be removed because row 1 has Foo, which was removed and has B=3.
>dat.new
1,2,Bar
7,5,Zap
Any ideas on how to do this would be appreciated.
We subset the 'B' values where 'C' is 'Foo', create a logical vector by checking those values in the 'B', negate (!) and also create a condition where the 'C' is not "Foo"
library(dplyr)
dat.new <- dat %>%
filter(!B %in% B[C == 'Foo'], C != 'Foo')
dat.new
# A B C
#1 1 2 Bar
#2 7 5 Zap
Or in base R with subset
subset(dat, !B %in% B[C == 'Foo'] & C != "Foo")
data
dat <- structure(list(A = c(4L, 2L, 1L, 7L), B = c(3L, 3L, 2L, 5L),
C = c("Foo", "Bar", "Bar", "Zap")), row.names = c(NA, -4L
), class = "data.frame")

R: passing column names as variables in custom function

I am quite new to R and programming in general and have been struggling with the following for a few hours now.
I am trying to create a function that will take a df and a column name as variables, filter the table based on the column name provided and print the output.
example_function <- function(df=df, col=col){
a <- df[col == 100,]
b <- filter(df, col == 100)
print(a)
print(b)
}
Using example_function(df=example_df, col='percentage') doesn't work, both variables return just the column names but no data rows (despite there being values == 100).
Using example_function(df=df, col=percentage), so percentage isn't surrounded by quotes here, I get:
Error in [.data.frame(df, col == 100, ) : object 'percentage' not
found
However, when I run example_function(df=example_df, col=example_df$percentage) I get the correct result, with my dataframe returning as expected with only those rows where the example_df$percentage is equal to 100.
I really want to be able to pass the df as one variable and the column as another without having to type example_df$percentage each time as I want to be able to re-use the function for many different dataframes and typing that seems redundant.
Based on this I then modified the function thinking that I can just use df$col in the function and it will evaluate to example_df$percentage and work like it did above:
example_function <- function(df=df, col=col){
a <- df[df$col == 100,]
b <- filter(df, df$col == 100)
print(a)
print(b)
}
But now I get another error when using example_function(df=example_df, col=percentage) or when passing col='percentage':
Error in filter_impl(.data, quo) : Result must have length 19, not 0
Would any body be able to help me fix this, or point me in the right direction to understand why what I'm doing isn't working?
Thanks so much
Here is an example of the dataframe I am using (although my real one will have more columns but I hope it won't make a difference for this example.)
name | percentage
-----------------------
tom | 80
john | 100
harry | 99
elizabeth| 100
james | 50
example_df <- structure(list(name = structure(c(5L, 4L, 2L, 1L, 3L), .Label = c("elizabeth",
"harry", "james", "john", "tom"), class = "factor"), percentage = c(80L,
100L, 99L, 100L, 50L)), .Names = c("name", "percentage"), class = "data.frame", row.names = c(NA,
-5L))
as a note, I have updated my col=names to col=percentage in this example to more accurately represent what I am doing. In my attempt to generalise the example I used col=names and now realise that it wasn't a very good example (as you quite rightly asserted that a 'name' is never likely to be numeric). The above problems still persist for me however.
** Update: I managed to get it working with the following:
example_function <- function(df=df, col=col){
a <- df[df[col] == 100,]
print(a)
}
passing example_function(df=example_df, col='percentage')
The first row of example_function should be
a <- df[df[[col]] == 100,]
When you break it down, df[['names']] == 100 will give you a list of logicals corresponding to which rows of df has a names value of 100. But 'names' == 100 is nonsensical: it's always false.

Merging in R based on column and row

For a sample dataframe:
survey <- structure(list(id = 1:10, cntry = structure(c(2L, 3L, 1L, 2L,
2L, 3L, 1L, 1L, 3L, 2L), .Label = c("DE", "FR", "UK"), class = "factor"),
age.cat = structure(c(1L, 1L, 2L, 4L, 1L, 3L, 4L, 4L, 1L,
2L), .Label = c("Y_15.24", "Y_40.54", "Y_55.plus", "Y_less.15"
), class = "factor")), .Names = c("id", "cntry", "age.cat"
), class = "data.frame", row.names = c(NA, -10L))
I want to add an extra column called 'age.cat' that is populated by another dataframe:
age.cat <- structure(list(cntry = structure(c(2L, 3L, 1L), .Label = c("DE",
"FR", "UK"), class = "factor"), Y_less.15 = c(0.2, 0.2, 0.3),
Y_15.24 = c(0.2, 0.1, 0.2), Y_25.39 = c(0.2, 0.3, 0.1), Y_40.54 = c(0.3,
0.2, 0.1), Y_55.plus = c(0.1, 0.2, 0.3)), .Names = c("cntry",
"Y_less.15", "Y_15.24", "Y_25.39", "Y_40.54", "Y_55.plus"), class = "data.frame", row.names = c(NA,
-3L))
The age.cat dataframe lists proportions of people in the three countries by the different age categories. The corresponding country/age category needs to be added as an additional column in the survey dataframe. Previously, when I used a single country for example, I use merge, but this wouldn't work here as I understand as I need matching on a column and row.
Does anyone have any ideas?
Using data.table, I'd do this directly as follows:
require(data.table) # v1.9.6+
dt1[dt2, ratio := unlist(mget(age.cat)), by=.EACHI, on="cntry"]
where,
dt1 = as.data.table(survey)[, age.cat := as.character(age.cat)]
dt2 = as.data.table(age.cat)
For each row in dt2, the matching rows in dt1$cntry are found corresponding to dt2$cntry (it helps to think of it like a subset operation by matching on cntry column). age.cat values for those matching rows are extracted and passed to mget() function, that looks for variables named with the values in age.cat, and finds it in dt2 (we allow for columns in dt2 to be also visible for exactly this purpose), and extracts the corresponding values. Since it returns a list, we unlist it. Those values are assigned to the column ratio by reference.
Since this avoids unnecessary materialising of intermediate data by melting/gathering, it is quite efficient. Additionally, since it adds a new column by reference while joining, it avoids another intermediate materialisation and is doubly efficient.
Personally, I find the code much more straightforward to understand as to what's going on (with sufficient base R knowledge of course), but that is of course subjective.
Slightly more detailed explanation:
The general form of data.table syntax is DT[i, j, by] which reads:
Take DT, subset rows by i, then compute j grouped by by.
The i argument in data.table, in addition to being subset operations e.g., dt1[cntry == "FR"], can also be another data.table.
Consider the expression: dt1[dt2, on="cntry"].
The first thing it does is to compute, for each row in dt2, all matching row indices in dt1 by matching on the column provided in on = "cntry". For example, for dt2$cntry == "FR", the matching row indices in dt1 are c(1,4,5,10). These row indices are internally computed using fast binary search.
Once the matching row indices are computed it looks as to whether an expression is provided in the j argument. In the above expression j is empty. Therefore it returns all the columns from both dt1 and dt2 (leading to a right join).
In other words, data.table allows join operations to be performed in a similar fashion to subsets (because in both operations, the purpose of i argument is to obtain matching rows). For example, dt1[cntry == "FR"] would first compute the matching row indices, and then extract all columns for those rows (since no columns are provided in the j argument). This has several advantages. For example, if we would only like to return a subset of columns, then we can do, for example:
dt1[dt2, .(cntry, Y_less.15), on="cntry"]
This is efficient because we look at the j expression and notice that only those two columns are required. Therefore on the computed row indices, we only extract the required columns thereby avoiding unnecessary materialisation of all the other columns. Hence efficient
Also, just like how we can select columns, we can also compute on columns. For example, what if you'd like to get sum(Y_less.15)?
dt1[dt2, sum(Y_less.15), on="cntry"]
# [1] 2.3
This is great, but it computes the sum on all the matching rows. What if you'd like to get the sum for each row in dt2$cntry? This is where by = .EACHI comes in.
dt1[dt2, sum(Y_less.15), on="cntry", by=.EACHI]
# cntry V1
# 1: FR 0.2
# 2: UK 0.2
# 3: DE 0.3
by=.EACHI ensures that the j expression is evaluated for each row in i = dt2.
Similarly, we can also add/update columns while joining using the := operator. And that's the answer shown above. The only tricky part there is to extract the values for those matching rows from dt2, since they are stored in separate columns. Hence we use mget(). And the expression unlist(mget(.)) gets evaluated for each row in dt2 while matching on "cntry" column. And the corresponding values are assigned to ratio by using the := operator.
For more details on history of := operator see this, this and this post on SO.
For more on by=.EACHI, see this post.
For more on data.table syntax introduction and reference semantics, see the vignettes.
Hope this helps.
You can turn age.cat into long format and then use join as follows:
library(dplyr)
library(tidyr)
age.cat <- gather(age.cat, age.cat, proportion, -cntry)
inner_join(survey, age.cat)
We can do a join after melting the second dataset to 'long' format
library(data.table) #v1.9.7
melt(setDT(age.cat), id.var="cntry")[survey, on = c("cntry", "variable" = "age.cat")]
# cntry variable value id
# 1: FR Y_15.24 0.2 1
# 2: UK Y_15.24 0.1 2
# 3: DE Y_40.54 0.1 3
# 4: FR Y_less.15 0.2 4
# 5: FR Y_15.24 0.2 5
# 6: UK Y_55.plus 0.2 6
# 7: DE Y_less.15 0.3 7
# 8: DE Y_less.15 0.3 8
# 9: UK Y_15.24 0.1 9
#10: FR Y_40.54 0.3 10
If we are using the CRAN version i.e. data.table_1.9.6,
melt(setDT(age.cat), id.var="cntry", variable.name = "age.cat")[survey,
on = c("cntry", "age.cat")]
you can do this, using packages reshape2 and dplyr:
age.cat %>% melt(variable.name="age.cat") %>% left_join(survey, .)
#### id cntry age.cat value
#### 1 1 FR Y_15.24 0.2
#### 2 2 UK Y_15.24 0.1
#### 3 3 DE Y_40.54 0.1
Is that what you want?

sample data.table rows with different conditions

I have a data.table with multiple columns. One of these columns currently works as a 'key' (keyb for the example). Another column (let's say A), may or may not have data in it. I would like to supply a vector that randomly sample two rows per key, -if this key appears in the vector, where 1 row contains data in A, while the other does not.
MRE:
#data.table
trys <- structure(list(keyb = c("x", "x", "x", "x", "x", "y", "y", "y",
"y", "y"), A = c("1", "", "1", "", "", "1", "", "", "1", "")), .Names = c("keyb",
"A"), row.names = c(NA, -10L), class = c("data.table", "data.frame"
))
setkey(trys,keyb)
#list with keys
list_try <- structure(list(a = "x", b = c("r", "y","x")), .Names = c("a", "b"))
I could, for instance subset the data.table based on the elements that appear in list_try:
trys[keyb %in% list_try[[2]]]
My original (and probably inefficient idea), was to try to chain a sample of two rows per key, where the A column has data or no data, and then merge. But it does not work:
#here I was trying to sample rows based on whether A has data or not
#here for rows where A has no data
trys[keyb %in% list_try[[2]]][nchar(A)==0][sample(.N, 2), ,by = keyb]
#here for rows where A has data
trys[keyb %in% list_try[[2]]][nchar(A)==1][sample(.N, 2), ,by = keyb]
In this case, my expected output would be two data.tables (one for a and one for b in list_try), of two rows per appearing element: So the data.table from a would have two rows (one with and without data in A), and the one from b, four rows (two with and two without data in A).
Please let me know if I can make this post any clearer
You could add A to the by statement too, while converting it to a binary vector by modifying to A != "", combine with a binary join (while adding nomatch = 0L in order to remove non-matches) you could then sample from the row index .I by those two aggregators and then subset from the original data set
For a single subset case
trys[trys[list_try[[2]], nomatch = 0L, sample(.I, 1L), by = .(keyb, A != "")]$V1]
# keyb A
# 1: y 1
# 2: y
# 3: x 1
# 4: x
For a more general case, when you want to create separate data sets according to a list of keys, you could easily embed this into lapply
lapply(list_try,
function(x) trys[trys[x, nomatch = 0L, sample(.I, 1L), by = .(keyb, A != "")]$V1])
# $a
# keyb A
# 1: x 1
# 2: x
#
# $b
# keyb A
# 1: y 1
# 2: y
# 3: x 1
# 4: x

Resources