Edit column under condition - r

I have a table:
id <- c(1,1,2,2,2,2,2,3,3,4,4,5,5,5)
dist <- c(0,1,1,0,2,15,0,4,4,0,5,5,16,2)
data <- data.frame(id, dist )
I would like to edit the column id when dist is superior to a certain value (let´s say 10). I am looking to add +1 when data$dist >10
The final output would be:
data$id_new <- c(1,1,2,2,2,3,3,4,4,5,5,6,7,7)
Is it possible to do something with a if loop? I tried to something with a loop but I am still not successful.

Maybe using cumsum:
data$new_id <- data$id + cumsum(data$dist > 10)
Explanation:
cumsum(data$dist > 10) will return the cumulative sum of indices in data$dist which are greater than 10. You can see how this works by taking the expression apart in R and seeing how each piece works.

We can use duplicated with >
with(data, cumsum(dist > 10| !duplicated(id)))
#[1] 1 1 2 2 2 3 3 4 4 5 5 6 7 7

Related

summarize results on a vector of different length of the original - Pivot table r

I would like to use the vector:
time.int<-c(1,2,3,4,5) #vector to be use as a "guide"
and the database:
time<-c(1,1,1,1,5,5,5)
value<-c("s","s","s","t","d","d","d")
dat1<- as.data.frame(cbind(time,value))
to create the following vector, which I can then add to the first vector "time.int" into a second database.
freq<-c(4,0,0,0,3) #wished result
This vector is the sum of the events that belong to each time interval, there are four 1 in "time" so the first value gets a four and so on.
Potentially I would like to generalize it so that I can decide the interval, for example saying sum in a new vector the events in "times" each 3 numbers of time.int.
EDIT for generalization
time.int<-c(1,2,3,4,5,6)
time<-c(1,1,1,2,5,5,5,6)
value<-c("s","s","s","t", "t","d","d","d")
dat1<- data.frame(time,value)
let's say I want it every 2 seconds (every 2 time.int)
freq<-c(4,0,4) #wished result
or every 3
freq<-c(4,4) #wished result
I know how to do that in excel, with a pivot table.
sorry if a duplicate I could not find a fitting question on this website, I do not even know how to ask this and where to start.
The following will produce vector freq.
freq <- sapply(time.int, function(x) sum(x == time))
freq
[1] 4 0 0 0 3
BTW, don't use the construct as.data.frame(cbind(.)). Use instead
dat1 <- data.frame(time,value))
In order to generalize the code above to segments of time.int of any length, I believe the following function will do it. Note that since you've changed the data the output for n == 1 is not the same as above.
fun <- function(x, y, n){
inx <- lapply(seq_len(length(x) %/% n), function(m) seq_len(n) + n*(m - 1))
sapply(inx, function(i) sum(y %in% x[i]))
}
freq1 <- fun(time.int, time, 1)
freq1
[1] 3 1 0 0 3 1
freq2 <- fun(time.int, time, 2)
freq2
[1] 4 0 4
freq3 <- fun(time.int, time, 3)
freq3
[1] 4 4
We can use the table function to count the event number and use merge to create a data frame summarizing the information. event_dat is the final output.
# Create example data
time.int <- c(1,2,3,4,5)
time <- c(1,1,1,1,5,5,5)
# Count the event using table and convert to a data frame
event <- as.data.frame(table(time))
# Convert the time.int to a data frame
time_dat <- data.frame(time = time.int)
# Merge the data
event_dat <- merge(time_dat, event, by = "time", all = TRUE)
# Replace NA with 0
event_dat[is.na(event_dat)] <- 0
# See the result
event_dat
time Freq
1 1 4
2 2 0
3 3 0
4 4 0
5 5 3

R : group by with custom functions

I have managed to aggregate data successfully using the following pattern:
newdf <- setDT(df)[, list(X=sum(x),Y=max(y)), by=Z]
However, the moment I try to do anything more complicated, although the code runs, it no longer aggregates by Z: it seems to create a dataframe with the same number of observations as the original df so I know that no grouping is actually occurring.
The custom function I would like to apply is to find the n-quantile for the current list of values and then do some other stuff with it. I saw use of sdcols in another SO answer and tried something like:
customfunc <- function(dt){
q = unname(quantile(dt$column,0.25))
n = nrow(dt[dt$column <= q])
return(n/dt$someOtherColumn)
}
#fails to group anything!!! also rather slow...
newdf <- setDT(df)[, customfunc(.SD), by=Z, .SDcols=c(column, someOtherColumn)]
Can someone please help me figure out what is wrong with the way I'm trying to use group by and custom functions? Thank you very much.
Literal example as requested:
> df <- data.frame(Z=c("abc","abc","def","abc"), column=c(1,2,3,4), someOtherColumn=c(5,6,7,8))
> df
Z column someOtherColumn
1 abc 1 5
2 abc 2 6
3 def 3 7
4 abc 4 8
> newdf <- setDT(df)[, customfunc(.SD), by=Z, .SDcols=c("column", "someOtherColumn")]
> newdf
Z V1
1: abc 0.2000000
2: abc 0.1666667
3: abc 0.1250000
4: def 0.1428571
>
As you can see, DF is not grouped. There should just be two rows, one for "abc", and another for "def" since I am trying to group by Z.
As guided by eddi's point above, the basic problem is thinking that your custom function is being called inside a loop and that 'dt$column' will mysteriously give you the 'current value at the current row'. Instead it gives you the entire column (a vector). The function is passed the entire data table, not row-wise bits of data.
So, replacing the value in the return statement with something that represents a single value works. Example:
customfunc <- function(dt){
q = unname(quantile(dt$column,0.25))
n = nrow(dt[dt$column <= q])
return(n/length(dt$someOtherColumn))
}
> df <- data.frame(Z=c("abc","abc","def","abc"), column=c(1,2,3,4), someOtherColumn=c(5,6,7,8))
> df
Z column someOtherColumn
1 abc 1 5
2 abc 2 6
3 def 3 7
4 abc 4 8
> newdf <- setDT(df)[, customfunc(.SD), by=Z, .SDcols=c("column", "someOtherColumn")]
> newdf
Z V1
1: abc 0.3333333
2: def 1.0000000
Now the data is aggregated correctly.

Using sum(x:y) to create a new variable/vector from existing values in R

I am working in R with a data frame d:
ID <- c("A","A","A","B","B")
eventcounter <- c(1,2,3,1,2)
numberofevents <- c(3,3,3,2,2)
d <- data.frame(ID, eventcounter, numberofevents)
> d
ID eventcounter numberofevents
1 A 1 3
2 A 2 3
3 A 3 3
4 B 1 2
5 B 2 2
where numberofevents is the highest value in the eventcounter for each ID.
Currently, I am trying to create an additional vector z <- c(6,6,6,3,3).
If the numberofevents == 3, it is supposed to calculate sum(1:3), equally to 3 + 2 + 1 = 6.
If the numberofevents == 2, it is supposed to calculate sum(1:2) equally to 2 + 1 = 3.
Working with a large set of data, I thought it might be convenient to create this additional vector
by using the sum function in R d$z<-sum(1:d$numberofevents), i.e.
sum(1:3) # for the rows 1-3
and
sum(1:2) # for the rows 4-5.
However, I always get this warning:
Numerical expression has x elements: only the first is used.
You can try ave
d$z <- with(d, ave(eventcounter, ID, FUN=sum))
Or using data.table
library(data.table)
setDT(d)[,z:=sum(eventcounter), ID][]
Try using apply sapply or lapply functions in R.
sapply(numberofevents, function(x) sum(1:x))
It works for me.

Subsetting data conditional on first instance in R

data:
row A B
1 1 1
2 1 1
3 1 2
4 1 3
5 1 1
6 1 2
7 1 3
Hi all! What I'm trying to do (example above) is to sum those values in column A, but only when column B = 1 (so starting with a simple subset line - below).
sum(data$A[data$B==1])
However, I only want to do this the first time that condition occurs until the values switch. If that condition re-occurs later in the column (row 5 in the example), I'm not interested in it!
I'd really appreciate your help in this (I suspect simple) problem!
Using data.table for syntax elegance, you can use rle to get this done
library(data.table)
DT <- data.table(data)
DT[ ,B1 := {
bb <- rle(B==1)
r <- bb$values
r[r] <- seq_len(sum(r))
bb$values <- r
inverse.rle(bb)
} ]
DT[B1 == 1, sum(a)]
# [1] 2
Here's a rather elaborate way of doing that:
data$counter = cumsum(data$B == 1)
sum(data$A[(data$counter >= 1:nrow(data) - sum(data$counter == 0)) &
(data$counter != 0)])
Another way:
idx <- which(data$B == 1)
sum(data$A[idx[idx == (seq_along(idx) + idx[1] - 1)]])
# [1] 2
# or alternatively
sum(data$A[idx[idx == seq(idx[1], length.out = length(idx))]])
# [1] 2
The idea: First get all indices of 1. Here it's c(2,3,5). From the start index = "2", you want to get all the indices that are continuous (or consecutive, that is, c(2,3,4,5...)). So, from 2 take that many consecutive numbers and equate them. They'll not be equal the moment they are not continuous. That is, once there's a mismatch, all the other following numbers will also have a mismatch. So, the first few numbers for which the match is equal will only be the ones that are "consecutive" (which is what you desire).

How to assign number of repeats to dataframe based on elements of an identifying vector in R?

I have a dataframe with individuals assigned a text id that concatenates a place-name with a personal id (see data, below). Ultimately, I need to do a transformation of the data set from "long" to "wide" (e.g., using "reshape") so that each individual comprises one row, only. In order to do that, I need to assign a "time" variable that reshape can use to identify time-varying covariates, etc. I have (probably bad) code to do this for individuals that repeat up to two times, but need to be able to identify up to 18 repeated occurrences. The code below works fine if I remove the line preceded by the hash, but only identifies up to two repeats. If I leave that line in (which would seem necessary for individuals repeated more than twice), R chokes, giving the following error (presumably because the first individual is repeated only twice):
Error in if (data$uid[i] == data$uid[i - 2]) { :
argument is of length zero
Can anyone help with this? Thanks in advance!
place <- rep("ny",10)
pid <- c(1,1,2,2,2,3,4,4,5,5)
uid<- paste(place,pid,sep="")
time <- rep(0,10)
data <- cbind(uid,time)
data <- as.data.frame(data)
data$time <- as.numeric(data$time)
#bad code
data$time[1] <- 1 #need to set first so that loop doesn't go to a row that doesn't exist (i.e., row 0)
for (i in 2:NROW(data)){
data$time[i] <- 1 #set first occurrence to 1
if (data$uid[i] == data$uid[i-1]) {data$time[i] <- 2} #set second occurrence to 2, etc.
#if (data$uid[i] == data$uid[i-2]) {data$time[i] <- 3}
i <- i+1
}
It's unclear what you are trying to do, but I think you're saying that you need to create a time index for each row by every unique uid. Is that right?
If so, give this a whirl
library(plyr)
ddply(data, "uid", transform, time = seq_along(uid))
Will give you something like:
uid time
1 ny1 1
2 ny1 2
3 ny2 1
4 ny2 2
5 ny2 3
....
Is this what you have in mind?
> d <- data.frame(uid = paste("ny",c(1,2,1,2,2,3,4,4,5,5),sep=""))
> out <- do.call(rbind, lapply(split(d, d$uid), function(x) {x$time <- 1:nrow(x); x}))
> rownames(out) <- NULL
> out
uid time
1 ny1 1
2 ny1 2
3 ny2 1
4 ny2 2
5 ny2 3
6 ny3 1
7 ny4 1
8 ny4 2
9 ny5 1
10 ny5 2
Using your data frame setup:
place <- rep("ny",10)
pid <- c(1,1,2,2,2,3,4,4,5,5)
uid<- paste(place,pid,sep="")
time <- rep(0,10)
data <- cbind(uid,time)
data <- as.data.frame(data)
You can use:
data$time <- sequence(table(data$uid))
data
To get:
> data
uid time
1 ny1 1
2 ny1 2
3 ny2 1
4 ny2 2
5 ny2 3
6 ny3 1
7 ny4 1
8 ny4 2
9 ny5 1
10 ny5 2
NOTE: Your data.frame MUST be sorted by uid first for this to work.
After trying the above solutions on large data sets, I decided to write my own loop for this. It was very time-consuming and still required the data to be broken into 50k-element vectors, but it did work in the end:
system.time( for(i in 2:length(data$uid)) {
if(data$uid[i]==data$uid[i-1]) data$repeats[i] <- data$repeats[i-1]+1
if ((i %% 1000)== 0) { #helps to keep track of how far the loop has gotten
print(i) }
i+1
}
)
Thanks to all for your help.

Resources