My data is in the following form:
y<-data.frame(atp=c(1,0,1,0,0,1),
ssmin=c(2,NA,3,NA,NA,1),
Day_1=round(runif(6,5,11),0),
Day_2=round(runif(6,88,110),0),
Day_3=round(runif(6,90,211),0))
I need to create a new column which picks the value from column 3, 4 or 5 depending on the value in column 2(ssmin).
The output would be like this:
FDRT<-c(89,NA,175,NA,NA,7)
I am trying out the following command but this does not help
y$new<- y[which(y$atp==1),na.omit(2+y$ssmin)]
Can any one help me how to write a code for it as my data is in large chunks and i cannot write value individually.
I think this might be what you're trying to do, but I'm not certain:
set.seed(1)
y<-data.frame(atp=c(1,0,1,0,0,1),
ssmin=c(2,NA,3,NA,NA,1),
Day_1=round(runif(6,5,11),0),
Day_2=round(runif(6,88,110),0),
Day_3=round(runif(6,90,211),0))
y
# atp ssmin Day_1 Day_2 Day_3
# 1 1 2 7 109 173
# 2 0 NA 7 103 136
# 3 1 3 8 102 183
# 4 0 NA 10 89 150
# 5 0 NA 6 93 177
# 6 1 1 10 92 210
x <- vapply(y$ssmin, function(x) unique(grep(x, names(y), value = TRUE)),
vector("character", 1L))
Z <- vector(length = length(x))
for (i in sequence(nrow(y))) {
Z[i] <- if (is.na(x[i])) NA else y[i, x[i]]
}
Z
# [1] 109 NA 183 NA NA 10
If I understand your question correctly, the last line you give almost solves your question. You just have to modify it slightly to get only the diagonal elements of the right hand side and only assign it to the applicable elements of the vector new. Here's the modified code.
y[which(y$atp==1), "new"] <- diag(as.matrix(y[which(y$atp==1),na.omit(2+y$ssmin)]))
not very elegant but short
y
atp ssmin Day_1 Day_2 Day_3
1 1 2 5 97 123
2 0 NA 8 108 165
3 1 3 10 109 190
4 0 NA 9 110 177
5 0 NA 10 91 182
6 1 1 7 94 141
> apply(y,1, function(r)r[r[2]+2])
[1] 97 NA 190 NA NA 7
for a more robust maintainable solution you probably want to hardcode the column names using ddply or somesuch.
Related
In the example below, the event start is defined as when the prior value of "values" is 90 or more and the current value is below 90. The event end is when the current value is below 90 and the next value is 90 or above.
sequential_index <- seq(1,10)
values <- c(91,90,89,89,90,90,89,88,90,91)
df <- data.frame(sequential_index, values)
Looking at df in the example above, the first event occurs for observations 3-4 and the second event occurs for observations 7-8. I am trying, to no avail, to add an "events" column to the above data frame that looks something like this:
sequential_index values events
1 1 91 NA
2 2 90 NA
3 3 89 1
4 4 89 1
5 5 90 NA
6 6 90 NA
7 7 89 2
8 8 88 2
9 9 90 NA
10 10 91 NA
My dataset is rather large and I'm trying to avoid for loops.
Thanks in advance,
-jt
I have this solution using dplyr.
library(dplyr)
df %>%
# Define the start of events (putting 1 at the start of events)
mutate(events = case_when(lag(values)>=90 & values<90 ~ 1, TRUE ~ 0)) %>%
# Extend the events using cumsum()
mutate(events = case_when(values<90 ~ cumsum(events)))
Output :
sequential_index values events
1 1 91 NA
2 2 90 NA
3 3 89 1
4 4 89 1
5 5 90 NA
6 6 90 NA
7 7 89 2
8 8 88 2
9 9 90 NA
10 10 91 NA
One option with base R would be rle
df$events <- inverse.rle(within.list(rle(df$values < 90),
values[values] <- seq_along(values[values])
))
df$events[df$events == 0] <- NA
df$events
#[1] NA NA 1 1 NA NA 2 2 NA NA
Or in a compact way with data.table
library(data.table)
setDT(df)[, events := as.integer(factor(rleid(events < 90)[events < 90]))]
Let's say I have a df that looks like this
ID X_Value
1 40
2 13
3 75
4 83
5 64
6 43
7 74
8 45
9 54
10 84
So what I would like to do, is to do a rolling function that if in the actual and last 4 rows, there are 2 or more values that are higher than X (let's say 70 for this example) then return 1, else 0.
So the output would be something like the following:
ID X_Value Next_4_2
1 40 0
2 13 0
3 75 0
4 83 1
5 64 1
6 43 1
7 24 1
8 45 0
9 74 0
10 84 1
I think this would be possible with a rolling function, but I have tried and not sure how to do it. Thank you in advance
Given your expected output, I suppose you meant "in the actual and previous 3 rows". Then using some rolling function indeed does the job:
library(zoo)
thr1 <- 70
thr2 <- 2
last <- 3 + 1
df$Next_4_2 <- 1 * (rollsum(df$X_Value > thr1, last, align = "right", fill = 0) >= thr2)
df
# ID X_Value Next_4_2
# 1 1 40 0
# 2 2 13 0
# 3 3 75 0
# 4 4 83 1
# 5 5 64 1
# 6 6 43 1
# 7 7 74 1
# 8 8 45 0
# 9 9 54 0
# 10 10 84 1
The indexing using max(1,i-3) is perhaps the only part of the code worth remembering. I might help in subsequent construction when a for-loop was really needed.
dat$X_Next_4_2 <- integer( length(dat$X_Value) )
dat$ X_Next_4_2[1]=0
for (i in 2:length(dat$X_Value) ){
dat$ X_Next_4_2[i]=
( sum(dat$X_Value[i: (max(0, i-4) )] >=70) >=2 )}
(Not very pretty and clearly inferior to the rollsum answer already posted.)
I am trying to shorten a chunk of code to make it faster and easier to modify. This is a short example of my data.
order obs year var1 var2 var3
1 3 1 1 32 588 NA
2 4 1 2 33 689 2385
3 5 1 3 NA 678 2369
4 33 3 1 10 214 1274
5 34 3 2 10 237 1345
6 35 3 3 10 242 1393
7 78 6 1 5 62 NA
8 79 6 2 5 75 296
9 80 6 3 5 76 500
10 93 7 1 NA NA NA
11 94 7 2 4 86 247
12 95 7 3 3 54 207
Basically, what I want is R to find any possible and unique combination of two values (observations) in column "obs", within the same year, to create a new matrix or DF with observations being the aggregation of the originals. Order is not important, so 1+6 = 6+1. For instance, having 150 observations, I will expect 11,175 feasible combinations (each year).
I sort of got what I want with basic coding but, as you will see, is way too long (I have built this way 66 different new data sets so it does not really make a sense) and I am wondering how to shorten it. I did some trials (plyr,...) with no real success. Here what I did:
# For the 1st year, groups of 2 obs
newmatrix <- data.frame(t(combn(unique(data$obs[data$year==1]), 2)))
colnames(newmatrix) <- c("obs1", "obs2")
newmatrix$name <- do.call(paste, c(newmatrix[c("obs1", "obs2")], sep = "_"))
# and the aggregation of var. using indexes, which I will skip here to save your time :)
To ilustrate, here the result, considering above sample, of what I would get for the 1st year. NA is because I only computed those where the 2 values were valid. And only for variables 1 and 3. More, I did the sum but it could be any other possible Function:
order obs1 obs2 year var1 var3
1 1 1 3 1_3 42 NA
2 2 1 6 1_6 37 NA
3 3 1 7 1_7 NA NA
4 4 3 6 3_6 15 NA
5 5 3 7 3_7 NA NA
6 6 6 7 6_7 NA NA
As for the 2 first lines in the 3rd year, same type of matrix:
order obs1 obs2 year var1 var3
1 1 1 3 1_3 NA 3762
2 2 1 6 1_6 NA 2868
.......... etc ............
I hope I explained myself. Thank you in advance for your hints on how to do this more efficient.
I would use split-apply-combine to split by year, find all the combinations, and then combine back together:
do.call(rbind, lapply(split(data, data$year), function(x) {
p <- combn(nrow(x), 2)
data.frame(order=paste(x$order[p[1,]], x$order[p[2,]], sep="_"),
obs1=x$obs[p[1,]],
obs2=x$obs[p[2,]],
year=x$year[1],
var1=x$var1[p[1,]] + x$var1[p[2,]],
var2=x$var2[p[1,]] + x$var2[p[2,]],
var3=x$var3[p[1,]] + x$var3[p[2,]])
}))
# order obs1 obs2 year var1 var2 var3
# 1.1 3_33 1 3 1 42 802 NA
# 1.2 3_78 1 6 1 37 650 NA
# 1.3 3_93 1 7 1 NA NA NA
# 1.4 33_78 3 6 1 15 276 NA
# 1.5 33_93 3 7 1 NA NA NA
# 1.6 78_93 6 7 1 NA NA NA
# 2.1 4_34 1 3 2 43 926 3730
# 2.2 4_79 1 6 2 38 764 2681
# 2.3 4_94 1 7 2 37 775 2632
# 2.4 34_79 3 6 2 15 312 1641
# 2.5 34_94 3 7 2 14 323 1592
# 2.6 79_94 6 7 2 9 161 543
# 3.1 5_35 1 3 3 NA 920 3762
# 3.2 5_80 1 6 3 NA 754 2869
# 3.3 5_95 1 7 3 NA 732 2576
# 3.4 35_80 3 6 3 15 318 1893
# 3.5 35_95 3 7 3 13 296 1600
# 3.6 80_95 6 7 3 8 130 707
This enables you to be very flexible in how you combine data pairs of observations within a year --- x[p[1,],] represents the year-specific data for the first element in each pair and x[p[2,],] represents the year-specific data for the second element in each pair. You can return a year-specific data frame with any combination of data for the pairs, and the year-specific data frames are combined into a single final data frame with do.call and rbind.
I am trying to rank multiple numeric variables ( around 700+ variables) in the data and am not sure exactly how to do this as I am still pretty new to using R.
I do not want to overwrite the ranked values in the same variable and hence need to create a new rank variable for each of these numeric variables.
From reading the posts, I believe assign and transform function along with rank maybe able to solve this. I tried implementing as below ( sample data and code) and am struggling to get it to work.
The output dataset in addition to variables xcount, xvisit, ysales need to be populated
With variables xcount_rank, xvisit_rank, ysales_rank containing the ranked values.
input <- read.table(header=F, text="101 2 5 6
102 3 4 7
103 9 12 15")
colnames(input) <- c("id","xcount","xvisit","ysales")
input1 <- input[,2:4] #need to rank the numeric variables besides id
for (i in 1:3)
{
transform(input1,
assign(paste(input1[,i],"rank",sep="_")) =
FUN = rank(-input1[,i], ties.method = "first"))
}
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 10)
The problem with this approach is that it's creating the rank values as (101, 230] , (230, 450] etc whereas I would like to see the values in the rank variable to be populated as 1, 2 etc up to 10 categories as per the splits I did. Is there any way to achieve this? input[5:7] <- lapply(input[5:7], rank, ties.method = "first")
The approach I tried from the solutions provided below is:
input <- read.table(header=F, text="101 20 5 6
102 2 4 7
103 9 12 15
104 100 8 7
105 450 12 65
109 25 28 145
112 854 56 93")
colnames(input) <- c("id","xcount","xvisit","ysales")
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 3)
Current output I get is:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 (1.15,286] (3.95,21.3] (5.86,52.3]
2 102 2 4 7 (1.15,286] (3.95,21.3] (5.86,52.3]
3 103 9 12 15 (1.15,286] (3.95,21.3] (5.86,52.3]
4 104 100 8 7 (1.15,286] (3.95,21.3] (5.86,52.3]
5 105 450 12 65 (286,570] (3.95,21.3] (52.3,98.7]
6 109 25 28 145 (1.15,286] (21.3,38.7] (98.7,145]
7 112 854 56 93 (570,855] (38.7,56.1] (52.3,98.7]
Desired output:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 1 1 1
2 102 2 4 7 1 1 1
3 103 9 12 15 1 1 1
4 104 100 8 7 1 1 1
5 105 450 12 65 2 1 2
6 109 25 28 145 1 2 3
Would like to see the records in the group they would fall under if I try to rank the interval values.
Using dplyr
library(dplyr)
nm1 <- paste("rank", names(input)[2:4], sep="_")
input[nm1] <- mutate_each(input[2:4],funs(rank(., ties.method="first")))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 2 5 6 1 2 1
#2 102 3 4 7 2 1 2
#3 103 9 12 15 3 3 3
Update
Based on the new input and using cut
input[nm1] <- mutate_each(input[2:4], funs(cut(., breaks=3, labels=FALSE)))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 20 5 6 1 1 1
#2 102 2 4 7 1 1 1
#3 103 9 12 15 1 1 1
#4 104 100 8 7 1 1 1
#5 105 450 12 65 2 1 2
#6 109 25 28 145 1 2 3
#7 112 854 56 93 3 3 2
I always use "with" instead of "within" within the context of my research, but I originally thought they were the same. Just now I mistype "with" for "within" and the results returned are quite different. I am wondering why?
I am using the baseball data in the plyr package, so I first load the library by
require(plyr)
Then, I want to select all rows with an id "ansonca01". At first, as I said, I used "within", and run the function as follows:
within(baseball, baseball[id=="ansonca01", ])
I got very strange results which basically includes everything:
id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
4 ansonca01 1871 1 RC1 25 120 29 39 11 3 0 16 6 2 2 1 NA NA NA NA NA
44 forceda01 1871 1 WS3 32 162 45 45 9 4 0 29 8 0 4 0 NA NA NA NA NA
68 mathebo01 1871 1 FW1 19 89 15 24 3 1 0 10 2 1 2 0 NA NA NA NA NA
99 startjo01 1871 1 NY2 33 161 35 58 5 1 1 34 4 2 3 0 NA NA NA NA NA
102 suttoez01 1871 1 CL1 29 128 35 45 3 7 3 23 3 1 1 0 NA NA NA NA NA
106 whitede01 1871 1 CL1 29 146 40 47 6 5 1 21 2 2 4 1 NA NA NA NA NA
113 yorkto01 1871 1 TRO 29 145 36 37 5 7 2 23 2 2 9 1 NA NA NA NA NA
.........
Then I use "with" instead of "within",
with(baseball, baseball[id=="ansonca01",])
and got the results that I expected
id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
4 ansonca01 1871 1 RC1 25 120 29 39 11 3 0 16 6 2 2 1 NA NA NA NA NA
121 ansonca01 1872 1 PH1 46 217 60 90 10 7 0 50 6 6 16 3 NA NA NA NA NA
276 ansonca01 1873 1 PH1 52 254 53 101 9 2 0 36 0 2 5 1 NA NA NA NA NA
398 ansonca01 1874 1 PH1 55 259 51 87 8 3 0 37 6 0 4 1 NA NA NA NA NA
525 ansonca01 1875 1 PH1 69 326 84 106 15 3 0 58 11 6 4 2 NA NA NA NA NA
I checked the documentation of with and within by typing help(with) in R environment, and got the following:
with is a generic function that evaluates expr in a local environment constructed from data. The environment has the caller's environment as its parent. This is useful for simplifying calls to modeling functions. (Note: if data is already an environment then this is used with its existing parent.)
Note that assignments within expr take place in the constructed environment and not in the user's workspace.
within is similar, except that it examines the environment after the evaluation of expr and makes the corresponding modifications to data (this may fail in the data frame case if objects are created which cannot be stored in a data frame), and returns it. within can be used as an alternative to transform.
From this explanation of the differences, I don't get why I obtained different results with such a simple operation. Anyone has ideas?
I find simple examples often work to highlight the difference. Something like:
df <- data.frame(a=1:5,b=2:6)
df
a b
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
with(df, {c <- a + b; df;} )
a b
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
within(df, {c <- a + b; df;} )
# equivalent to: within(df, c <- a + b)
# i've just made the return of df explicit
# for comparison's sake
a b c
1 1 2 3
2 2 3 5
3 3 4 7
4 4 5 9
5 5 6 11
The documentation is quite clear about the semantics and return values (and nicely matches the everyday meanings of the words “with” and “within”):
Value:
For ‘with’, the value of the evaluated ‘expr’. For ‘within’, the
modified object.
Since your code doesn’t modify anything inside baseball, the unmodified baseball is returned. with on the other hand doesn’t return the object, it returns expr.
Here’s an example where the expression modifies the object:
> head(within(cars, speed[dist < 20] <- 1))
speed dist
1 1 2
2 1 10
3 1 4
4 7 22
5 1 16
6 1 10
As above, with returns the value of the last evaluated expression. It is handy for one-liners such as:
with(cars, summary(lm (speed ~ dist)))
but is not suitable for sending multiple expressions.
I often find within useful for manipulating a data.frame or list (or data.table) as I find the syntax easy to read.
I feel that the documentation could be improved by adding examples of use in this regard, e.g.:
df1 <- data.frame(a=1:3,
b=4:6,
c=letters[1:3])
## library("data.table")
## df1 <- as.data.table(df1)
df1 <- within(df1, {
a <- 10:12
b[1:2] <- letters[25:26]
c <- a
})
df1
giving
a b c
1: 10 y 10
2: 11 z 11
3: 12 6 12
and
df1 <- as.list(df1)
df1 <- within(df1, {
a <- 20:23
b[1:2] <- letters[25:26]
c <- paste0(a, b)
})
df1
giving
$a
[1] 20 21 22 23
$b
[1] "y" "z" "6"
$c
[1] "20y" "21z" "226" "23y"
Note also that methods("within") gives only these object types, being:
within.data.frame
within.list
(and within.data.table if the package is loaded).
Other packages may define additional methods.
Perhaps unexpectedly for some, with and within are generally not appropriate choices when manipulating variables within defined environments...
To address the comment - there is no within.environment method. Using with requires you to have the function you're calling within the environment, which somewhat defeats the purpose for me e.g.
df1 <- as.environment(df1)
## with(df1, ls()) ## Error
assign("ls", ls, envir=df1)
with(df1, ls())