My data is structured as follows:
DT <- data.table(Id=c(1,2,3,4,5), Va1=c(3,13,NA,NA,NA), Va2=c(4,40,NA,NA,4), Va3=c(5,34,NA,7,84),
Va4=c(2,23,NA,63,9), Vb1=c(8,45,1,7,0), Vb2=c(0,35,0,7,6), Vb3=c(63,0,0,0,5), Vc1=c(2,5,0,0,4))
>DT
Id Va1 Va2 Va3 Va4 Vb1 Vb2 Vb3 Vc1
1: 1 3 4 5 2 8 0 63 2
2: 2 13 40 34 23 45 35 0 5
3: 3 NA NA NA NA 1 0 0 0
4: 4 NA NA 7 63 7 7 0 0
5: 5 NA 4 84 9 0 6 5 4
additionally, I have a reference list that references all the column groups:
reference <- list(g.1=c(2,3,4,5), g.2=c(6,7,8), g.3=c(9))
Columns 2,3,4,5 (variables Va1, Va2, Va3, and Va4) belong to one group of variables. Columns 6,7,8 (variables Vb1, Vb2, Vb3) belong to a second group. Column 9 (variable Vc1) belongs to a third group.
What I need to do is calculate the difference between consecutive columns within column groups.
I.e. I need to find the difference between Va2 and Va1, and between Va3 and Va2, etc... but not between Vb1 and Va4.
The output should look like:
Id Va1 Va2 Va3 Va4 Vb1 Vb2 Vb3 Vc1 D[Va1:Va2] D[Va2:Va3] D[Va3:Va4] D[Vb1:Vb2] D[Vb2:Vb3]
1: 1 3 4 5 2 8 0 63 2 1 1 -3 -8 63
2: 2 13 40 34 23 45 35 0 5 27 -6 -11 -10 -35
3: 3 NA NA NA NA 1 0 0 0 NA NA NA -1 0
4: 4 NA NA 7 63 7 7 0 0 NA NA 56 0 -7
5: 5 NA 4 84 9 0 6 5 4 NA 80 -75 6 -1
Currently I am using the following loop:
for(i in 1:(length(reference)-1)){
tmp <- NULL
tmp <- as.list(reference[[i]])
tmp <- tmp[-length(tmp)]
tmp <- mapply(c, lapply(tmp, FUN = function(x) x+1), tmp, SIMPLIFY=FALSE)
for(j in 1:length(tmp)){
data <- cbind(data, delta = data[, tmp[[j]][1], with = F] - data[, tmp[[j]][2], with = F])
}
}
but my real data.table has 300-500 columns and +1'000'000 rows.
How can I make this more efficient?
I think your loop is fine, except you should use := instead of cbind to add columns:
ref <- lapply(reference,function(x) names(DT)[x])
for (g in ref){
if (length(g)==1) next
gx = tail(g,-1)
gy = head(g,-1)
gn = paste0("D[",gy,":",gx,"]")
DT[,(gn) := mapply(function(x,y).SD[[x]]-.SD[[y]], gx, gy, SIMPLIFY=FALSE)]
}
Related
data <- structure(list(
x = c(1, 2, 1, 2, 2, 1, 3, 3, 1),
y = c(20, 30, 40, 10, 15, 34, 57, 72, 12)),
class = "data.frame",
row.names = c(NA,-9L))
Hi guys, I want to create a new variable from above data.frame in rstudio but it doesn't work. what I want to do is the same of this command in stata but in rstudio
gen var = y*3600 if x == 1
so I runned this r command but it didn´t work:
df$var[df$x == 1] <- df$y*3600
the new variable should look like this:
x
y
var
1
20
72000
2
30
NA
1
40
144000
2
10
NA
2
15
NA
1
34
122400
3
57
NA
3
72
NA
1
12
43200
I appreciate any help and thanks in advance
data$var <- ifelse(data$x == 1, data$y * 3600, NA)
x y var
1 1 20 72000
2 2 30 NA
3 1 40 144000
4 2 10 NA
5 2 15 NA
6 1 34 122400
7 3 57 NA
8 3 72 NA
9 1 12 43200
We can use replace like below
> transform(
+ data,
+ var = replace(y * 3600, x != 1, NA)
+ )
x y var
1 1 20 72000
2 2 30 NA
3 1 40 144000
4 2 10 NA
5 2 15 NA
6 1 34 122400
7 3 57 NA
8 3 72 NA
9 1 12 43200
Another option
df$var <- df$y * 3600
df$var[df$x != 1] <- NA
df
#-------
> df
x y var
1 1 20 72000
2 2 30 NA
3 1 40 144000
4 2 10 NA
5 2 15 NA
6 1 34 122400
7 3 57 NA
8 3 72 NA
9 1 12 43200
In data.table
library('data.table')
as.data.table(data)
data[x == 1, var := y*3600]
You need to subset the data from both the ends.
data$var <- NA
data$var[data$x == 1] <- data$y[data$x == 1] *3600
data
# x y var
#1 1 20 72000
#2 2 30 NA
#3 1 40 144000
#4 2 10 NA
#5 2 15 NA
#6 1 34 122400
#7 3 57 NA
#8 3 72 NA
#9 1 12 43200
Another option is to use case_when in dplyr.
library(dplyr)
data <- data %>% mutate(var = case_when(x == 1 ~ y * 3600))
By default if a condition is not satisfied it returns NA.
I have a string of basketball player stats like in the example below:
stats <- c("40pt 2rb 1as 2st 2to 4trey 11-20fg 14-14ft",
"7pt 5rb 1as 2st 1bl 3to 3-5fg 1-4ft",
"0pt 1rb 1as 0-2fg")
Ideally I would like to transform this string into tabular format:
This is the key for each column:
pt=points
rb=rebounds
as=assists
st=steals
bl=blocks
to=turnovers
trey=3 pointers made
fg=field goals made-attempted
ft=free throws made-attempted
We split the string at the boundary between letter and digit to create the list ('lst'), loop through the list, change it to a data.frame with column names from the alternate split values, rbind the elements with rbindlist, split the elements having - to multiple columns with cSplit and convert the NA values to 0
library(data.table)
library(splitstackshape)
lst <- strsplit(stats, "(?<=[0-9])(?=[a-z])|\\s+", perl = TRUE)
lst1 <- lapply(lst, function(x)
as.data.frame.list(setNames(x[c(TRUE, FALSE)], x[c(FALSE, TRUE)])))
res <- cSplit(rbindlist(lst1, fill = TRUE), c('fg', 'ft'), '-')
for(nm in seq_along(res)){
set(res, i = NULL, j = nm, value = as.numeric(as.character(res[[nm]])))
set(res, i = which(is.na(res[[nm]])), j = nm, value = 0)
}
res
# pt rb as st to trey bl fg_1 fg_2 ft_1 ft_2
#1: 40 2 1 2 2 4 0 11 20 14 14
#2: 7 5 1 2 3 0 1 3 5 1 4
#3: 0 1 1 0 0 0 0 0 2 0 0
use dcast from reshape 2 package:
m=gsub("(\\d+)-(\\d+)(\\w+)","\\1\\3_m \\2\\3_a",stats)
n=gsub("(\\d+)(\\S*)","\\1 \\2",gsub("\\s","\n",m))
o=cbind(read.table(text=n),group=rep(1:length(n),lengths(strsplit(n,"\n"))))
dcast(o,group~V2,value.var="V1")
group as bl fg_a fg_m ft_a ft_m pt rb st to trey
1 1 1 NA 20 11 14 14 40 2 2 2 4
2 2 1 1 5 3 4 1 7 5 2 3 NA
3 3 1 NA 2 0 NA NA 0 1 NA NA NA
Using base R
> m=gsub("(\\d+)-(\\d+)(\\w+)","\\1\\3_m \\2\\3_a",stats)
> n=gsub("(\\d+)(\\S*)","\\1 \\2",gsub("\\s","\n",m))
> o=lapply(n,function(x)rev(read.table(text=x)))
> p=Reduce(function(x,y)merge(x,y,by="V2",all=T),o)
> read.table(text=do.call(paste,data.frame(t(p))),h=T)
as fg_a fg_m ft_a ft_m pt rb st to trey bl
1 1 20 11 14 14 40 2 2 2 4 NA
2 1 5 3 4 1 7 5 2 3 NA 1
3 1 2 0 NA NA 0 1 NA NA NA NA
I have a file with 3 columns. the 1st column is ID, 2nd and 3rd are values for 2 conditions. in condition columns I have both - and + values. I would like to make 2 separate files. the 1st one would be for the negative values and the 2nd one would be for the positive values. do you know how to that in R?
Something like this ?
set.seed(1)
df1 <- data.frame(id=1:5,cond1 = sample(-100:100,5), cond2 = sample(-100:100,5))
df_neg <- df_pos <- df1
df_pos[,2:3][df1[,2:3]<0] <- NA # or 0, or NULL
df_neg[,2:3][df1[,2:3]>0] <- NA # or 0, or NULL
# > df1
# id cond1 cond2
# 1 1 -47 80
# 2 2 -26 88
# 3 3 13 31
# 4 4 79 24
# 5 5 -61 -88
# > df_pos
# id cond1 cond2
# 1 1 NA 80
# 2 2 NA 88
# 3 3 13 31
# 4 4 79 24
# 5 5 NA NA
# > df_neg
# id cond1 cond2
# 1 1 -47 NA
# 2 2 -26 NA
# 3 3 NA NA
# 4 4 NA NA
# 5 5 -61 -88
These are examples of two dataframes I am working on. 'Claims' has fewer rows than 'lastaction'.
My attempts give the following error.
newtable <- merge(claims, lastaction, by = "X", all = TRUE)
Error in [<-.data.frame(tmp, value, value = NA) : new columns
would leave holes after existing columns
newtable <- merge(claims, lastaction, by.x = claims$X, by.y = lastaction$X, all = TRUE)
Error in fix.by(by.x, x) : 'by' must match numbers of columns
merge function works fine for me. As both dataframes have the same column name X, it can be used to merge using by.
claims = data.frame(X = c(10,24,30,35,64,104),
TransactionDateTime = c('JUL-15','APR-17','SEP-15','JUL-15','APR-16','SEP-15'))
claims
# X TransactionDateTime
# 1 10 JUL-15
# 2 24 APR-17
# 3 30 SEP-15
# 4 35 JUL-15
# 5 64 APR-16
# 6 104 SEP-15
lastaction = data.frame(X = c(10,24,30,35,40,57), lastvalue = c(6,1,4,6,6,1),
Approvalmonth = c('15-OCT','17-JAN','16-MAR','15-OCT','15-SEP','17-JUN'),
lastvalue = c(0,1,0,0,0,1))
lastaction
# X lastvalue Approvalmonth lastvalue
# 1 10 6 15-OCT 0
# 2 24 1 17-JAN 1
# 3 30 4 16-MAR 0
# 4 35 6 15-OCT 0
# 5 40 6 15-SEP 0
# 6 57 1 17-JUN 1
merge(claims, lastaction, by = "X", all = TRUE)
# X TransactionDateTime lastvalue Approvalmonth lastvalue.1
# 1 10 JUL-15 6 15-OCT 0
# 2 24 APR-17 1 17-JAN 1
# 3 30 SEP-15 4 16-MAR 0
# 4 35 JUL-15 6 15-OCT 0
# 5 40 <NA> 6 15-SEP 0
# 6 57 <NA> 1 17-JUN 1
# 7 64 APR-16 NA <NA> NA
# 8 104 SEP-15 NA <NA> NA
dplyr's full_join as well works
dplyr::full_join(claims, lastaction, by = 'X')
X TransactionDateTime lastvalue Approvalmonth lastvalue.y
1 10 JUL-15 6 15-OCT 6
2 24 APR-17 1 17-JAN 1
3 30 SEP-15 4 16-MAR 4
4 35 JUL-15 6 15-OCT 6
5 64 APR-16 NA <NA> NA
6 104 SEP-15 NA <NA> NA
7 40 <NA> 6 15-SEP 6
8 57 <NA> 1 17-JUN 1
I always use "with" instead of "within" within the context of my research, but I originally thought they were the same. Just now I mistype "with" for "within" and the results returned are quite different. I am wondering why?
I am using the baseball data in the plyr package, so I first load the library by
require(plyr)
Then, I want to select all rows with an id "ansonca01". At first, as I said, I used "within", and run the function as follows:
within(baseball, baseball[id=="ansonca01", ])
I got very strange results which basically includes everything:
id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
4 ansonca01 1871 1 RC1 25 120 29 39 11 3 0 16 6 2 2 1 NA NA NA NA NA
44 forceda01 1871 1 WS3 32 162 45 45 9 4 0 29 8 0 4 0 NA NA NA NA NA
68 mathebo01 1871 1 FW1 19 89 15 24 3 1 0 10 2 1 2 0 NA NA NA NA NA
99 startjo01 1871 1 NY2 33 161 35 58 5 1 1 34 4 2 3 0 NA NA NA NA NA
102 suttoez01 1871 1 CL1 29 128 35 45 3 7 3 23 3 1 1 0 NA NA NA NA NA
106 whitede01 1871 1 CL1 29 146 40 47 6 5 1 21 2 2 4 1 NA NA NA NA NA
113 yorkto01 1871 1 TRO 29 145 36 37 5 7 2 23 2 2 9 1 NA NA NA NA NA
.........
Then I use "with" instead of "within",
with(baseball, baseball[id=="ansonca01",])
and got the results that I expected
id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
4 ansonca01 1871 1 RC1 25 120 29 39 11 3 0 16 6 2 2 1 NA NA NA NA NA
121 ansonca01 1872 1 PH1 46 217 60 90 10 7 0 50 6 6 16 3 NA NA NA NA NA
276 ansonca01 1873 1 PH1 52 254 53 101 9 2 0 36 0 2 5 1 NA NA NA NA NA
398 ansonca01 1874 1 PH1 55 259 51 87 8 3 0 37 6 0 4 1 NA NA NA NA NA
525 ansonca01 1875 1 PH1 69 326 84 106 15 3 0 58 11 6 4 2 NA NA NA NA NA
I checked the documentation of with and within by typing help(with) in R environment, and got the following:
with is a generic function that evaluates expr in a local environment constructed from data. The environment has the caller's environment as its parent. This is useful for simplifying calls to modeling functions. (Note: if data is already an environment then this is used with its existing parent.)
Note that assignments within expr take place in the constructed environment and not in the user's workspace.
within is similar, except that it examines the environment after the evaluation of expr and makes the corresponding modifications to data (this may fail in the data frame case if objects are created which cannot be stored in a data frame), and returns it. within can be used as an alternative to transform.
From this explanation of the differences, I don't get why I obtained different results with such a simple operation. Anyone has ideas?
I find simple examples often work to highlight the difference. Something like:
df <- data.frame(a=1:5,b=2:6)
df
a b
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
with(df, {c <- a + b; df;} )
a b
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
within(df, {c <- a + b; df;} )
# equivalent to: within(df, c <- a + b)
# i've just made the return of df explicit
# for comparison's sake
a b c
1 1 2 3
2 2 3 5
3 3 4 7
4 4 5 9
5 5 6 11
The documentation is quite clear about the semantics and return values (and nicely matches the everyday meanings of the words “with” and “within”):
Value:
For ‘with’, the value of the evaluated ‘expr’. For ‘within’, the
modified object.
Since your code doesn’t modify anything inside baseball, the unmodified baseball is returned. with on the other hand doesn’t return the object, it returns expr.
Here’s an example where the expression modifies the object:
> head(within(cars, speed[dist < 20] <- 1))
speed dist
1 1 2
2 1 10
3 1 4
4 7 22
5 1 16
6 1 10
As above, with returns the value of the last evaluated expression. It is handy for one-liners such as:
with(cars, summary(lm (speed ~ dist)))
but is not suitable for sending multiple expressions.
I often find within useful for manipulating a data.frame or list (or data.table) as I find the syntax easy to read.
I feel that the documentation could be improved by adding examples of use in this regard, e.g.:
df1 <- data.frame(a=1:3,
b=4:6,
c=letters[1:3])
## library("data.table")
## df1 <- as.data.table(df1)
df1 <- within(df1, {
a <- 10:12
b[1:2] <- letters[25:26]
c <- a
})
df1
giving
a b c
1: 10 y 10
2: 11 z 11
3: 12 6 12
and
df1 <- as.list(df1)
df1 <- within(df1, {
a <- 20:23
b[1:2] <- letters[25:26]
c <- paste0(a, b)
})
df1
giving
$a
[1] 20 21 22 23
$b
[1] "y" "z" "6"
$c
[1] "20y" "21z" "226" "23y"
Note also that methods("within") gives only these object types, being:
within.data.frame
within.list
(and within.data.table if the package is loaded).
Other packages may define additional methods.
Perhaps unexpectedly for some, with and within are generally not appropriate choices when manipulating variables within defined environments...
To address the comment - there is no within.environment method. Using with requires you to have the function you're calling within the environment, which somewhat defeats the purpose for me e.g.
df1 <- as.environment(df1)
## with(df1, ls()) ## Error
assign("ls", ls, envir=df1)
with(df1, ls())