using function in lapply in data.table in r - r

If there is a sample data set as below.
> tmp <- data.table(x=c(1:10),y=(5:14))
> tmp
x y
1: 1 5
2: 2 6
3: 3 7
4: 4 8
5: 5 9
6: 6 10
7: 7 11
8: 8 12
9: 9 13
10: 10 14
I want choose two lowest number and I want change 0 value to the other numbers.
like
x y
1: 1 5
2: 2 6
3: 0 0
4: 0 0
5: 0 0
6: 0 0
7: 0 0
8: 0 0
9: 0 0
10: 0 0
I think the coding is
tmp[, c("x","y"):=lapply(.SD, x[which(!x %in% sort(x)[1:2])] = 0}), .SDcols=c("x","y")]
but it changes all 0
How can i solve this problem.

To expand on my comment, I'd do something like this:
for (j in names(tmp)) {
col = tmp[[j]]
min_2 = sort.int(unique(col), partial=2L)[2L] # 2nd lowest value
set(tmp, i = which(col > min_2), j = j, value = 0L)
}
This loops over all the columns in tmp, and gets the 2nd minimum value for each column using sort.int with partial argument, which is slightly more efficient than using sort (as we don't have to sort the entire data set to find the 2nd minimum value).
Then we use set() to replace those rows where the column value is greater than the 2nd minimum value, for that column, with the value 0.

May be you can try
tmp[, lapply(.SD, function(x) replace(x,
!rank(x, ties.method='first') %in% 1:2, 0))]
# x y
#1: 1 5
#2: 2 6
#3: 0 0
#4: 0 0
#5: 0 0
#6: 0 0
#7: 0 0
#8: 0 0
#9: 0 0
#10:0 0

Related

does anyone have a solution for the coin problem, when using R?

There are four types of common coins in US currency:
quarters (25 cents)
dimes (10 cents)
nickels (5 cents), and
pennies (1 cent)
There are six ways to make change for 15 cents:
A dime and a nickel
A dime and 5 pennies
3 nickels
2 nickels and 5 pennies
A nickel and 10 pennies
15 pennies
Task:
How many ways are there to make change for a dollar using these common coins? (1 dollar = 100 cents).
tl;dr
There are 242 possibilities to make 1 dollar out of an unlimited supply of 1, 5, 10 and 25 cent-pieces.
code
here is a go at it using the comboGeneral()-function from the RcppAlgos-package.
Just set sum_constraint to the sum you want the coins values to add up to.
library(RcppAlgos)
library(data.table)
# possible coin-values
vec <- c( 1, 5, 10, 25 )
#desired sum
sum_constraint <- 15
l <- lapply( 1:sum_constraint / min(vec) , function(x) {
#calculate possible combinations (output = matrix)
temp <- comboGeneral( vec,
m = x,
repetition = TRUE,
constraintFun = "sum",
comparisonFun = "==",
limitConstraints = sum_constraint )
#create rowwise frequency-table of the freshly created matrix,
#and convert the table to a data.frame
as.data.frame.matrix( table( c( row(temp)), c(temp) ) )
})
#bind the list together to a data.table
answer <- rbindlist(l, idcol = "no_coins", use.names = TRUE, fill = TRUE )
#set missing values to 0
answer[ is.na(answer) ] <- 0
#output
answer
sum_constraint = 15
# no_coins 5 10 1
# 1: 2 1 1 0
# 2: 3 3 0 0
# 3: 6 0 1 5
# 4: 7 2 0 5
# 5: 11 1 0 10
# 6: 15 0 0 15
sum_constraint = 100
# no_coins 25 5 10 1
# 1: 4 4 0 0 0
# 2: 6 3 1 2 0
# 3: 7 3 3 1 0
# 4: 7 2 0 5 0
# 5: 8 3 5 0 0
# ---
# 238: 88 0 3 0 85
# 239: 91 0 0 1 90
# 240: 92 0 2 0 90
# 241: 96 0 1 0 95
# 242: 100 0 0 0 100
# no_coins 25 5 10 1

Why in ifelse function with AND statements the wrong results

I have the following dataset:
df1 <- data.frame(number = c(1,1,0,0,0,0,0,1,1))
In this dataset i want to create a second column, which shows if in the certain row of the first column there is a case, when first and second lags are equal to 0 and the first lead equals to 1. If this is a case, so the number 1 is put in the second column where change from 0 to 1 occurred (if not the case so equals to 44. As a result, in this output all rows in the second column should equal to 44 except the 8th.
here is my code. and in the comments below I will put a photo of the required result.
df1$t<-ifelse(df1[,1]==1 & lag(df1[,1]==0,1,default = 44) & lag(df1[,1]==0,2,default = 44)
& lead(df1[,1]==1,1,default = 44)
,1,44)
Athough the OP has asked for an explanation why his code does not return the expected result (which is addressed by Gregor's comment) I would like to propose an alternative approach.
If I understand correctly, the OP wants to find all sub-sequences in df1$number which consist of two zeros followed by two ones, i.e., c(0, 0, 1, 1). Then, the row which contains the first one in the sub-sequence should be marked by a 1 while all other rows should get 44 as default value.
As of version v1.12.0 (on CRAN 13 Jan 2019) of data.table, the shift() function recognizes negative lag/lead parameters. By this, a column can be shifted by multiple values in one batch. The row numbers which fulfill above condition are identified by a subsequent join operation. Finally df1 is updated selectively using these row numbers:
# use enhanced sample dataset, rows 10 to 21 added
df1 <- data.frame(number = c(1,1,0,0,0,0,0,1,1,0,1,0,1,1,0,0,1,0,0,1,1))
library(data.table)
setDT(df1)[, t := 44] # coerce to data.table, pre-populate result column
# shift and join
idx <- df1[, shift(number, 2:-1)][.(0, 0, 1, 1), on = paste0("V", 1:4), which = TRUE]
df1[idx, t := 1] # selective update
df1
number t
1: 1 44
2: 1 44
3: 0 44
4: 0 44
5: 0 44
6: 0 44
7: 0 44
8: 1 1
9: 1 44
10: 0 44
11: 1 44
12: 0 44
13: 1 44
14: 1 44
15: 0 44
16: 0 44
17: 1 44
18: 0 44
19: 0 44
20: 1 1
21: 1 44
number t
This works essentially as OP's approach by shifting and comparing with expected values. However, OP's approach requires to code four comparisions and three shift operations while here the shifting is done in one step and the comparison of all columns simultaneously is done by the join operation in the second step.
Additional explanations
The shift operation
df1[, shift(number, 2:-1)]
returns
V1 V2 V3 V4
1: NA NA 1 1
2: NA 1 1 0
3: 1 1 0 0
4: 1 0 0 0
5: 0 0 0 0
6: 0 0 0 0
7: 0 0 0 1
8: 0 0 1 1
9: 0 1 1 0
10: 1 1 0 1
11: 1 0 1 0
12: 0 1 0 1
13: 1 0 1 1
14: 0 1 1 0
15: 1 1 0 0
16: 1 0 0 1
17: 0 0 1 0
18: 0 1 0 0
19: 1 0 0 1
20: 0 0 1 1
21: 0 1 1 NA
V1 V2 V3 V4
In the subsequent join operation,
df1[, shift(number, 2:-1)][.(0, 0, 1, 1), on = paste0("V", 1:4), which = TRUE]
which = TRUE asks for returning only the indices of matching rows which are
[1] 8 20

dcast in r with duplicates and no aggregation

I have looked through many similar questions here but can't find an answer that addresses this situation. My data frame is like this:
SET SP T1 T2 T3
A dog 1 0 0
A cat 0 NA 4
A bird 5 0 NA
B cat 2 0 0
B bird NA 3 0
C dog 1 0 0
C cat 0 0 6
C bird 0 0 0
D dog NA 22 1
Where SET is purposefully duplicated many times, with each record including a single SP and values for multiple TRIALS (T1-3).
What I desire is a wide dataframe like the following. There is to be NO summation/averaging/mathematical operation of any kind:
SET DOG_T1 DOG_T2 DOG_T3 CAT_T1 CAT_T2 CAT_T3 BIRD_T1 BIRD_T2 BIRD_T3
142 1 0 0 0 NA 4 5 0 NA
255 NA NA NA 2 0 0 NA 3 0
336 1 0 0 0 0 6 0 0 0
66 NA 22 1 NA NA NA NA NA NA
I have tried the following, but receive the error with melt and dcast defaulting to length. This turns the SET variable into different numbers and only fills in 0s and 1s for the value.
df %>%
group_by(SET, SP) %>%
melt(id.vars = c('SET','SP')) %>%
data.table::dcast(SP + variable ~ SET, fun.aggregate = NULL, value.var = 'value')
This works when I DON'T have any duplicate SETs, but fails as soon as I include the full dataset.
A note: my real data frame is about 2.5 million rows, so speed is of concern.
This is a situation where dcast.data.table excels. It allows for multiple arguments as 'value.var', allowing for very concise syntax:
library(data.table)
dcast(df, SET ~ SP, value.var=c("T1", "T2", "T3"))
# SET T1_bird T1_cat T1_dog T2_bird T2_cat T2_dog T3_bird T3_cat T3_dog
#1: A 5 0 1 0 NA 0 NA 4 0
#2: B NA 2 NA 3 0 NA 0 0 NA
#3: C 0 0 1 0 0 0 0 6 0
#4: D NA NA NA NA NA 22 NA NA 1
As per #lukeA above but add fun.aggregate = identity or fun.aggregate = list argument in dcast() function call
You could try
library(tidyverse)
df <- read.table(header=T, text="
SET SP T1 T2 T3
A dog 1 0 0
A cat 0 NA 4
A bird 5 0 NA
B cat 2 0 0
B bird NA 3 0
C dog 1 0 0
C cat 0 0 6
C bird 0 0 0
D dog NA 22 1")
df %>%
gather(var, val, -(1:2)) %>%
unite("SP", SP, var) %>%
spread(SP, val)
# SET bird_T1 bird_T2 bird_T3 cat_T1 cat_T2 cat_T3 dog_T1 dog_T2 dog_T3
# 1 A 5 0 NA 0 NA 4 1 0 0
# 2 B NA 3 0 2 0 0 NA NA NA
# 3 C 0 0 0 0 0 6 1 0 0
# 4 D NA NA NA NA NA NA NA 22 1
Since my current reputation doesn't allow adding a comment to #lukeA answer above, I am making this a new answer which is more of a suggestion:
using the data.table function setcolorder one could have the columns reordered as "bird_T1, cat_T1, dog_T1, bird_T2, cat_T2 etc." by using a custom function similar to
newOrder <- function() {
lapply(1:max(index)
, function(i) grep(
sprintf('%s', i)
, names(DT), value = TRUE)
)}
where index is the index created in order to allow dcast.data.table for duplicated keys such as the SET variable in the initial table above:
DT[, index := 1:.N, by = SET]
finally the new order is achieved in regular manner:
setcolorder(dcast(DT), neworder = newOrder())

R add columns by loop in data table

I have a data table like this:
DT <- data.table(ID=rep(c(1:2),each=6), year=rep(c(2003:2006),each=3), month=rep(c(5:8),3), day=rep(c(11:14),3),value=c(101:112))
And I would like to add columns with the conditions:
1, add 5 columns with names: V100, V102, V105, V108, V112
2, in each column, grouped by ID and year, sum up the values less than the value in the column name, eg: for column V112, sum up grouped values less than 112
So the outcome will look like:
DT1 <- data.table(ID=rep(c(1:2),each=2), year=c(2003:2006), "100"=rep(0,4), "102"=c(2,0,0,0),"105"=c(3,2,0,0),"108"=c(3,3,2,0),"112"=rep(3,4))
I tried write codes but couldn't figure out:
degree <- c(100,102,105,108,112)
for (d in degree)
{
f_year <- function(d) {sum(DT$value <= d)}
DT <- DT[,d:=f_year(),by=list(ID,year)]
}
Any help would be appreciated!
Thats what lapply can be used for.
degree <- c(100, 102, 105, 108, 112)
myfun <- function(x,y) sum(y <= x)
DT1 <- DT[, lapply(degree, myfun, value), by = .(ID, year)]
setnames(DT1, c("ID", "year", as.character(degree)))
Result:
> DT1
ID year 100 102 105 108 112
1: 1 2003 0 2 3 3 3
2: 1 2004 0 0 2 3 3
3: 2 2005 0 0 0 2 3
4: 2 2006 0 0 0 0 3
Just another way:
cols = c(100,102,105,108,112)
DT[, lapply(cols, function(x) sum(value <= x)), by=.(ID, year)]
# ID year V1 V2 V3 V4 V5
# 1: 1 2003 0 2 3 3 3
# 2: 1 2004 0 0 2 3 3
# 3: 2 2005 0 0 0 2 3
# 4: 2 2006 0 0 0 0 3
Then you can set the names.
Instead if you'd like to set names directly, then you can create a named list first:
named_cols = setattr(as.list(cols), 'names', cols)
DT[, lapply(named_cols, function(x) sum(value<=x)), by=.(ID, year)]
# ID year 100 102 105 108 112
# 1: 1 2003 0 2 3 3 3
# 2: 1 2004 0 0 2 3 3
# 3: 2 2005 0 0 0 2 3
# 4: 2 2006 0 0 0 0 3

interchanging values after comparing two columns in R

i want to write a code that checks two columns in a dataframe and compares them. one is supposed to have lower limit and the other upper limits. if values on the upper limit columns are less than on the lower limit, them it should interchange the values. if both lower and upper limits are zero, it should replace the upper limit column with a value say 2. a sample data is as below:
lower_limit upper_limit
0 3
0 4
5 2
0 15
0 0
0 0
7 4
8 2
after running the code, it should produce something like
lower_limit upper_limit
0 3
0 4
2 5
0 15
0 2
0 2
4 7
2 8
dfrm <- read.table(text="lower_limit upper_limit
0 3
0 4
5 2
0 15
0 0
0 0
7 4
8 2", header=TRUE)
dfrm2 <- dfrm
dfrm2[,2] <- pmax(dfrm[,1], dfrm[,2] )
dfrm2[,1] <- pmin(dfrm[,1], dfrm[,2] );
dfrm2[abs(pmax(dfrm[,1],dfrm[,2]))==0 , 2] <- 2
> dfrm2
lower_limit upper_limit
1 0 3
2 0 4
3 2 5
4 0 15
5 0 2
6 0 2
7 4 7
8 2 8
Assuming dat is the name of your data frame/matrix:
setNames(as.data.frame(t(apply(dat, 1, function(x) {
tmp <- sort(x);
tmp[2] <- tmp[2] + (!any(x)) * 2;
return(tmp) }))), colnames(dat))
lower_limit upper_limit
1 0 3
2 0 4
3 2 5
4 0 15
5 0 2
6 0 2
7 4 7
8 2 8
How it works?
The function apply is used to apply a function to each line (argument 1). In this function, x represents a line of dat. Firstly, the values are ordered (with sort) and stored in the object tmp. Then, the second value of tmp is replaced with 2 if both values are 0. Finally, tmp is returned. The function apply returns the results as matrix, which needs to be transposed (with t). This matrix is transformed to a data frame (as.data.frame) with the same column names as the original object dat (with setNames).

Resources