Adding a second column as a function of first with data.table's (`:=`) [duplicate] - r

I want to create a new data.table or maybe just add some columns to a data.table. It is easy to specify multiple new columns but what happens if I want a third column to calculate a value based on one of the columns I am creating. I think plyr package can do something such as that. Can we perform such iterative (sequential) column creation in data.table?
I want to do as follows
dt <- data.table(shop = 1:10, income = 10:19*70)
dt[ , list(hope = income * 1.05, hopemore = income * 1.20, hopemorerealistic = hopemore - 100)]
or maybe
dt[ , `:=`(hope = income*1.05, hopemore = income*1.20, hopemorerealistic = hopemore-100)]

You can also use <- within the call to list eg
DT <- data.table(a=1:5)
DT[, c('b','d') := list(b1 <- a*2, b1*3)]
DT
a b d
1: 1 2 6
2: 2 4 12
3: 3 6 18
4: 4 8 24
5: 5 10 30
Or
DT[, `:=`(hope = hope <- a+1, z = hope-1)]
DT
a b d hope z
1: 1 2 6 2 1
2: 2 4 12 3 2
3: 3 6 18 4 3
4: 4 8 24 5 4
5: 5 10 30 6 5

It is possible by using curly braces and semicolons in j
There are multiple ways to go about it, here are two examples:
# If you simply want to output:
dt[ ,
{hope=income*1.05;
hopemore=income*1.20;
list(hope=hope, hopemore=hopemore, hopemorerealistic=hopemore-100)}
]
# if you want to save the values
dt[ , c("hope", "hopemore", "hopemorerealistic") :=
{hope=income*1.05;
hopemore=income*1.20;
list(hope, hopemore, hopemore-100)}
]
dt
# shop income hope hopemore hopemorerealistic
# 1: 1 700 735.0 840 740
# 2: 2 770 808.5 924 824
# 3: 3 840 882.0 1008 908
# 4: 4 910 955.5 1092 992
# 5: 5 980 1029.0 1176 1076
# 6: 6 1050 1102.5 1260 1160
# 7: 7 1120 1176.0 1344 1244
# 8: 8 1190 1249.5 1428 1328
# 9: 9 1260 1323.0 1512 1412
# 10: 10 1330 1396.5 1596 1496

Related

Why does the frequency reduce if I use ifelse function in R?Is there a way to create categories from the combination of 2 variables/columns?

when I do
table(df$strategy.x)
0 1 2 3
70 514 223 209
table(df$strategy.y)
0 1 2 3
729 24 7 4
I want to create a variable with both of these combined. I tried this
df <- df %>%
mutate(nstrategy1 = ifelse(strategy.x==1| strategy.y==1 , 1, 0))
table(df$nstrategy1)
0 1
399 519
I am supposed to get 514 + 24 = 538 but I got 519 instead
df <- df %>% mutate(nstrategy2 = ifelse(strategy.x==2| strategy.y==2 , 1, 0))
table(df$nstrategy2)
0 1
578 228
Similarly, I am supposed to get 223 + 7 = 230, but I got 228 instead
Is there a good way to merge both strategy.x and strategy.y and end up with a table like the following with 4 categories?
0 1 2 3
799 538 230 213
table(mtcars$am) # 13 1's
table(mtcars$vs) # 14 1's
mtcars$ones = ifelse(mtcars$am == 1 | mtcars$vs == 1, 1, 0)
table(mtcars$ones) # 20 1's < 13 + 14 = 27
Why is it showing only 20 1's instead of 27? It's because there are 7 + 6 + 7 = 20 cars with either one or two 1's in am and vs. There are 13 with am==1 (6+7), and 14 with vs==1 (7+7). Seven cars are in the bottom left because they have 1's in both dimensions, which you are expecting/seeking to count twice.
table(mtcars$am, mtcars$vs)
# 0 1
# 0 12 7
# 1 6 7
The simplest way to get the sum of the two results would be by adding the two table objects:
table(mtcars$am) + table(mtcars$vs)
# 0 1
# 37 27

how to select data based on a list from a split data frame and then recombine in R

I am trying to do the following. I have a dataset Test:
Item_ID Test_No Category Sharpness Weight Viscocity
132 1 3 14.93199362 94.37250417 579.4236727
676 1 4 44.58750591 70.03232054 1829.170727
699 2 5 89.02760079 54.30587287 1169.226863
850 3 6 30.74535903 83.84377678 707.2280513
951 4 237 67.79568019 51.10388484 917.6609965
1031 5 56 74.06697003 63.31274502 1981.17804
1175 4 354 98.9656142 97.7523884 100.7357981
1483 5 726 9.958040999 51.29537311 1222.910211
1529 7 800 64.11430235 65.69780939 573.8266137
1698 9 125 67.83105185 96.53847341 486.9620194
1748 9 1005 49.43602318 52.9139591 1881.740184
2005 9 28 26.89821508 82.12663209 1709.556135
2111 2 76 83.03593144 85.23622731 276.5088502
I would want to split this data based on Test_No and then compute the number of unique Category per Test_No and also the Median Category value. I chose to use split and Sappply in the following way. But, I am getting an error regarding a missing parenthesis. Is there anything wrong in my approach ? Please find my code below:
function(CatRange){
c(Cat_Count = length(unique(CatRange$Category)), Median_Cat = median(unique(CatRange$Category), na.rm = TRUE) )
}
CatStat <- do.call(rbind,sapply(split(Test, Test$Test_No), function(ModRange)))
Appending my question:
I would want to display the data containing the following information:
Test_No, Category, Median_Cat and Cat_Count
We can try with dplyr
library(dplyr)
Test %>%
group_by(Test_No) %>%
summarise(Cat_Count = n_distinct(Category),
Median_Cat = median(Category,na.rm = TRUE),
Category = toString(Category))
# Test_No Cat_Count Median_Cat Category
# <int> <int> <dbl> <chr>
#1 1 2 3.5 3, 4
#2 2 2 40.5 5, 76
#3 3 1 6.0 6
#4 4 2 295.5 237, 354
#5 5 2 391.0 56, 726
#6 7 1 800.0 800
#7 9 3 125.0 125, 1005, 28
Or if you prefer base R we can also try with aggregate
aggregate(Category~Test_No, CatRange, function(x) c(Cat_Count = length(unique(x)),
Median_Cat = median(x,na.rm = TRUE), Category = toString(x)))
As far as the function written is concerned I think there are some synatx issues in it.
new_func <- function(CatRange){
c(Cat_Count = length(unique(CatRange$Category)),
Median_Cat = median(unique(CatRange$Category), na.rm = TRUE),
Category = toString(CatRange$Category))
}
data.frame(t(sapply(split(CatRange, CatRange$Test_No), new_func)))
# Cat_Count Median_Cat Category
#1 2 3.5 3, 4
#2 2 40.5 5, 76
#3 1 6 6
#4 2 295.5 237, 354
#5 2 391 56, 726
#7 1 800 800
#9 3 125 125, 1005, 28

R: sum rows from column A until conditioned value in column B

I'm pretty new to R and can't seem to figure out how to deal with what seems to be a relatively simple problem. I want to sum the rows of the column 'DURATION' per 'TRIAL_INDEX', but then only those first rows where the values of 'X_POSITION" are increasing. I only want to sum the first round within a trial where X increases.
The first rows of a simplified dataframe:
TRIAL_INDEX DURATION X_POSITION
1 1 204 314.5
2 1 172 471.6
3 1 186 570.4
4 1 670 539.5
5 1 186 503.6
6 2 134 306.8
7 2 182 503.3
8 2 806 555.7
9 2 323 490.0
So, for TRIAL_INDEX 1, only the first three values of DURATION should be added (204+172+186), as this is where X has the highest value so far (going through the dataframe row by row).
The desired output should look something like:
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
I tried to use dplyr, to generate a new dataframe that can be merged with my original dataframe.
However, the code doesn't work, and also I'm not sure on how to make sure it's only adding the first rows per trial that have increasing values for X_POSITION.
FirstPassRT = dat %>%
group_by(TRIAL_INDEX) %>%
filter(dplyr::lag(dat$X_POSITION,1) > dat$X_POSITION) %>%
summarise(FIRST_PASS_TIME=sum(DURATION))
Any help and suggestions are greatly appreciated!
library(data.table)
dt = as.data.table(df) # or setDT to convert in place
# find the rows that will be used for summing DURATION
idx = dt[, .I[1]:.I[min(.N, which(diff(X_POSITION) < 0), na.rm = T)], by = TRIAL_INDEX]$V1
# sum the DURATION for those rows
dt[idx, time := sum(DURATION), by = TRIAL_INDEX][, time := time[1], by = TRIAL_INDEX]
dt
# TRIAL_INDEX DURATION X_POSITION time
#1: 1 204 314.5 562
#2: 1 172 471.6 562
#3: 1 186 570.4 562
#4: 1 670 539.5 562
#5: 1 186 503.6 562
#6: 2 134 306.8 1122
#7: 2 182 503.3 1122
#8: 2 806 555.7 1122
#9: 2 323 490.0 1122
Here is something you can try with dplyr package:
library(dplyr);
dat %>% group_by(TRIAL_INDEX) %>%
mutate(IncLogic = X_POSITION > lag(X_POSITION, default = 0)) %>%
mutate(FIRST_PASS_TIME = sum(DURATION[IncLogic])) %>%
select(-IncLogic)
Source: local data frame [9 x 4]
Groups: TRIAL_INDEX [2]
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
(int) (int) (dbl) (int)
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
If you want to summarize it down to one row per trial you can use summarize like this:
library(dplyr)
df <- data_frame(TRIAL_INDEX = c(1,1,1,1,1,2,2,2,2),
DURATION = c(204,172,186,670, 186,134,182,806, 323),
X_POSITION = c(314.5, 471.6, 570.4, 539.5, 503.6, 306.8, 503.3, 555.7, 490.0))
res <- df %>%
group_by(TRIAL_INDEX) %>%
mutate(x.increasing = ifelse(X_POSITION > lag(X_POSITION), TRUE, FALSE),
x.increasing = ifelse(is.na(x.increasing), TRUE, x.increasing)) %>%
filter(x.increasing == TRUE) %>%
summarize(FIRST_PASS_TIME = sum(X_POSITION))
res
#Source: local data frame [2 x 2]
#
# TRIAL_INDEX FIRST_PASS_TIME
# (dbl) (dbl)
#1 1 1356.5
#2 2 1365.8

Extract specific numbers from string in R

I have this example:
> exemplo
V1 V2
local::/raiz/diretorio/adminadmin/ 1
local::/raiz/diretorio/jatai_p_user/ 2
local::/raiz/diretorio/adminteste/ 3
local::/raiz/diretorio/adminteste2/ 4
local::/raiz/diretorio/48808032191/ 5
local::/raiz/diretorio/85236250110/ 6
local::/raiz/diretorio/92564593100/ 7
local::/raiz/diretorio/AACB/036/03643936451/ 331
home::22723200159 3894
home::98476963300 3895
home::15239136149 3896
home::01534562567 3897
I would like extract just numbers with exact 11 characters (in first column), producing results like this one:
> exemplo
V1 V2
48808032191 5
85236250110 6
92564593100 7
03643936451 331
22723200159 3894
98476963300 3895
15239136149 3896
01534562567 3897
Any help would be great :-)
Here's one way using stringr, where d is your dataframe:
library(stringr)
m <- str_extract(d$V1, '\\d{11}')
na.omit(data.frame(V1=m, V2=d$V2))
# V1 V2
# 5 48808032191 5
# 6 85236250110 6
# 7 92564593100 7
# 8 03643936451 331
# 9 22723200159 3894
# 10 98476963300 3895
# 11 15239136149 3896
# 12 01534562567 3897
The approach above will match strings of at least 11 numerals. In response to #JoshO'Brien's comment, if you only want to match exactly 11 numerals, then you can use:
m <- str_extract(d$V1, perl('(?<!\\d)\\d{11}(?!\\d)'))
DF <- read.table(text = "V1 V2
local::/raiz/diretorio/adminadmin/ 1
local::/raiz/diretorio/jatai_p_user/ 2
local::/raiz/diretorio/adminteste/ 3
local::/raiz/diretorio/adminteste2/ 4
local::/raiz/diretorio/48808032191/ 5
local::/raiz/diretorio/85236250110/ 6
local::/raiz/diretorio/92564593100/ 7
local::/raiz/diretorio/AACB/036/03643936451/ 331
home::22723200159 3894
home::98476963300 3895
home::15239136149 3896
home::01534562567 3897", header = TRUE)
pattern <- "\\d{11}"
m <- regexpr(pattern, DF$V1)
DF1 <- DF[attr(m, "match.length") > -1,]
DF1$V1<- regmatches(DF$V1, m)
# V1 V2
#5 48808032191 5
#6 85236250110 6
#7 92564593100 7
#8 03643936451 331
#9 22723200159 3894
#10 98476963300 3895
#11 15239136149 3896
#12 01534562567 3897
Here's how I'd approach it. This can be done in base R but stringi's consistency in naming makes it so easy to use not to mention it's fast. I'd store the 11 digits as a new column rather than overwrite the old one.
dat <- read.table(text="V1 V2
local::/raiz/diretorio/adminadmin/ 1
local::/raiz/diretorio/jatai_p_user/ 2
local::/raiz/diretorio/adminteste/ 3
local::/raiz/diretorio/adminteste2/ 4
local::/raiz/diretorio/48808032191/ 5
local::/raiz/diretorio/85236250110/ 6
local::/raiz/diretorio/92564593100/ 7
local::/raiz/diretorio/AACB/036/03643936451/ 331
home::22723200159 3894
home::98476963300 3895
home::15239136149 3896
home::01534562567 3897", header=TRUE)
library(stringi)
dat[["V3"]] <- unlist(stri_extract_all_regex(dat[["V1"]], "\\d{11}"))
dat[!is.na(dat[["V3"]]), 3:2]
## V3 V2
## 5 48808032191 5
## 6 85236250110 6
## 7 92564593100 7
## 8 03643936451 331
## 9 22723200159 3894
## 10 98476963300 3895
## 11 15239136149 3896
## 12 01534562567 3897
The command you are looking for is grep(). The pattern to use there would be something like \d{11} or [0-9]{11}.

Filter rows based on values of multiple columns in R

Here is the data set, say name is DS.
Abc Def Ghi
1 41 190 67
2 36 118 72
3 12 149 74
4 18 313 62
5 NA NA 56
6 28 NA 66
7 23 299 65
8 19 99 59
9 8 19 61
10 NA 194 69
How to get a new dataset DSS where value of column Abc is greater than 25, and value of column Def is greater than 100.It should also ignore any row if value of atleast one column in NA.
I have tried few options but wasn't successful. Your help is appreciated.
There are multiple ways of doing it. I have given 5 methods, and the first 4 methods are faster than the subset function.
R Code:
# Method 1:
DS_Filtered <- na.omit(DS[(DS$Abc > 20 & DS$Def > 100), ])
# Method 2: which function also ignores NA
DS_Filtered <- DS[ which( DS$Abc > 20 & DS$Def > 100) , ]
# Method 3:
DS_Filtered <- na.omit(DS[(DS$Abc > 20) & (DS$Def >100), ])
# Method 4: using dplyr package
DS_Filtered <- filter(DS, DS$Abc > 20, DS$Def >100)
DS_Filtered <- DS %>% filter(DS$Abc > 20 & DS$Def >100)
# Method 5: Subset function by default ignores NA
DS_Filtered <- subset(DS, DS$Abc >20 & DS$Def > 100)

Resources