R : group by with custom functions - r

I have managed to aggregate data successfully using the following pattern:
newdf <- setDT(df)[, list(X=sum(x),Y=max(y)), by=Z]
However, the moment I try to do anything more complicated, although the code runs, it no longer aggregates by Z: it seems to create a dataframe with the same number of observations as the original df so I know that no grouping is actually occurring.
The custom function I would like to apply is to find the n-quantile for the current list of values and then do some other stuff with it. I saw use of sdcols in another SO answer and tried something like:
customfunc <- function(dt){
q = unname(quantile(dt$column,0.25))
n = nrow(dt[dt$column <= q])
return(n/dt$someOtherColumn)
}
#fails to group anything!!! also rather slow...
newdf <- setDT(df)[, customfunc(.SD), by=Z, .SDcols=c(column, someOtherColumn)]
Can someone please help me figure out what is wrong with the way I'm trying to use group by and custom functions? Thank you very much.
Literal example as requested:
> df <- data.frame(Z=c("abc","abc","def","abc"), column=c(1,2,3,4), someOtherColumn=c(5,6,7,8))
> df
Z column someOtherColumn
1 abc 1 5
2 abc 2 6
3 def 3 7
4 abc 4 8
> newdf <- setDT(df)[, customfunc(.SD), by=Z, .SDcols=c("column", "someOtherColumn")]
> newdf
Z V1
1: abc 0.2000000
2: abc 0.1666667
3: abc 0.1250000
4: def 0.1428571
>
As you can see, DF is not grouped. There should just be two rows, one for "abc", and another for "def" since I am trying to group by Z.

As guided by eddi's point above, the basic problem is thinking that your custom function is being called inside a loop and that 'dt$column' will mysteriously give you the 'current value at the current row'. Instead it gives you the entire column (a vector). The function is passed the entire data table, not row-wise bits of data.
So, replacing the value in the return statement with something that represents a single value works. Example:
customfunc <- function(dt){
q = unname(quantile(dt$column,0.25))
n = nrow(dt[dt$column <= q])
return(n/length(dt$someOtherColumn))
}
> df <- data.frame(Z=c("abc","abc","def","abc"), column=c(1,2,3,4), someOtherColumn=c(5,6,7,8))
> df
Z column someOtherColumn
1 abc 1 5
2 abc 2 6
3 def 3 7
4 abc 4 8
> newdf <- setDT(df)[, customfunc(.SD), by=Z, .SDcols=c("column", "someOtherColumn")]
> newdf
Z V1
1: abc 0.3333333
2: def 1.0000000
Now the data is aggregated correctly.

Related

Replace nth consecutive occurrence of a value

I want to replace the nth consecutive occurrence of a particular code in my data frame. This should be a relatively easy task but I can't think of a solution.
Given a data frame
df <- data.frame(Values = c(1,4,5,6,3,3,2),
Code = c(1,1,2,2,2,1,1))
I want a result
df_result <- data.frame(Values = c(1,4,5,6,3,3,2),
Code = c(1,0,2,2,2,1,0))
The data frame is time-ordered so I need to keep the same order after replacing the values. I guess that nth() or duplicate() functions could be useful here but I'm not sure how to use them. What I'm missing is a function that would count the number of consecutive occurrences of a given value. Once I have it, I could then use it to replace the nth occurrence.
This question had some ideas that I explored but still didn't solve my problem.
EDIT:
After an answer by #Gregor I wrote the following function which solves the problem
library(data.table)
library(dplyr)
replace_nth <- function(x, nth, code) {
y <- data.table(x)
y <- y[, code_rleid := rleid(y$Code)]
y <- y[, seq := seq_along(Code), by = code_rleid]
y <- y[seq == nth & Code == code, Code := 0]
drop.cols <- c("code_rleid", "seq")
y %>% select(-one_of(drop.cols)) %>% data.frame() %>% return()
}
To get the solution, simply run replace_nth(df, 2, 1)
Using data.table:
library(data.table)
setDT(df)
df[, code_rleid := rleid(df$Code)]
df[, seq := seq_along(Code), by = code_rleid]
df[seq == 2 & Code == 1, Code := 0]
df
# Values Code code_rleid seq
# 1: 1 1 1 1
# 2: 4 0 1 2
# 3: 5 2 2 1
# 4: 6 2 2 2
# 5: 3 2 2 3
# 6: 3 1 3 1
# 7: 2 0 3 2
You could combine some of these (and drop the extra columns after). I'll leave it clear and let you make modifications as you like.

Edit column under condition

I have a table:
id <- c(1,1,2,2,2,2,2,3,3,4,4,5,5,5)
dist <- c(0,1,1,0,2,15,0,4,4,0,5,5,16,2)
data <- data.frame(id, dist )
I would like to edit the column id when dist is superior to a certain value (let´s say 10). I am looking to add +1 when data$dist >10
The final output would be:
data$id_new <- c(1,1,2,2,2,3,3,4,4,5,5,6,7,7)
Is it possible to do something with a if loop? I tried to something with a loop but I am still not successful.
Maybe using cumsum:
data$new_id <- data$id + cumsum(data$dist > 10)
Explanation:
cumsum(data$dist > 10) will return the cumulative sum of indices in data$dist which are greater than 10. You can see how this works by taking the expression apart in R and seeing how each piece works.
We can use duplicated with >
with(data, cumsum(dist > 10| !duplicated(id)))
#[1] 1 1 2 2 2 3 3 4 4 5 5 6 7 7

Using sum(x:y) to create a new variable/vector from existing values in R

I am working in R with a data frame d:
ID <- c("A","A","A","B","B")
eventcounter <- c(1,2,3,1,2)
numberofevents <- c(3,3,3,2,2)
d <- data.frame(ID, eventcounter, numberofevents)
> d
ID eventcounter numberofevents
1 A 1 3
2 A 2 3
3 A 3 3
4 B 1 2
5 B 2 2
where numberofevents is the highest value in the eventcounter for each ID.
Currently, I am trying to create an additional vector z <- c(6,6,6,3,3).
If the numberofevents == 3, it is supposed to calculate sum(1:3), equally to 3 + 2 + 1 = 6.
If the numberofevents == 2, it is supposed to calculate sum(1:2) equally to 2 + 1 = 3.
Working with a large set of data, I thought it might be convenient to create this additional vector
by using the sum function in R d$z<-sum(1:d$numberofevents), i.e.
sum(1:3) # for the rows 1-3
and
sum(1:2) # for the rows 4-5.
However, I always get this warning:
Numerical expression has x elements: only the first is used.
You can try ave
d$z <- with(d, ave(eventcounter, ID, FUN=sum))
Or using data.table
library(data.table)
setDT(d)[,z:=sum(eventcounter), ID][]
Try using apply sapply or lapply functions in R.
sapply(numberofevents, function(x) sum(1:x))
It works for me.

Referencing previous rows of a data.frame to compute a new column in R

I am trying to calculate changes in weight between visits to chicks at different nests. This requires R to look up the nest code in the current row, find the previous time that nest was visited, and subtract the weight at the previous visit from the current visit. For the first visit to each nest, I would like to output the current weight (i.e. as though the weight at the previous, non-existent visit was zero).
My data is of the form:
Nest <- c(a,b,c,d,e,c,b,c)
Weight <- c(2,4,3,3,2,6,8,10)
df <- data.frame(Nest, Weight)
So the desired output here would be:
Change <- c(2,4,3,3,2,3,4,4)
I have achieved the desired output once, by subsetting to a single nest and using a for loop:
tmp <- subset(df, Nest == "a")
tmp$change <- tmp$Weight
for(x in 2:(length(tmp$Nest))){
tmp$change[x] <- tmp$Weight[(x)] - tmp$Weight[(x-1)]
}
but when I try to fit this into ddply
df2 <- ddply(df, "Nest", function(f) {
f$change <- f$Weight
for(x in 2:(length(f$Nest))){
f$change <- f$Weight[(x)] - f$Weight[(x-1)]
}
})
the output gives a blank data.frame (0 obs. of 0 variables).
Am I approaching this the right way but getting the code wrong? Or is there a better way to do it?
Thanks in advance!
Try this:
library(dplyr)
df %>% group_by(Nest) %>% mutate(Change = c(Weight[1], diff(Weight)))
or with just the base of R
transform(df, Change = ave(Weight, Nest, FUN = function(x) c(x[1], diff(x))))
Here is a data.table solution. With large data sets, this is likely to be faster.
library(data.table)
setDT(df)[,Change:=c(Weight[1],diff(Weight)),by=Nest]
df
# Nest Weight Change
# 1: a 2 2
# 2: b 4 4
# 3: c 3 3
# 4: d 3 3
# 5: e 2 2
# 6: c 6 3
# 7: b 8 4
# 8: c 10 4

Number of Unique Obs by Variable in a Data Table

I have read in a large data file into R using the following command
data <- as.data.set(spss.system.file(paste(path, file, sep = '/')))
The data set contains columns which should not belong, and contain only blanks. This issue has to do with R creating new variables based on the variable labels attached to the SPSS file (Source).
Unfortunately, I have not been able to determine the options necessary to resolve the problem. I have tried all of: foreign::read.spss, memisc:spss.system.file, and Hemisc::spss.get, with no luck.
Instead, I would like to read in the entire data set (with ghost columns) and remove unnecessary variables manually. Since the ghost columns contain only blank spaces, I would like to remove any variables from my data.table where the number of unique observations is equal to one.
My data are large, so they are stored in data.table format. I would like to determine an easy way to check the number of unique observations in each column, and drop columns which contain only one unique observation.
require(data.table)
### Create a data.table
dt <- data.table(a = 1:10,
b = letters[1:10],
c = rep(1, times = 10))
### Create a comparable data.frame
df <- data.frame(dt)
### Expected result
unique(dt$a)
### Expected result
length(unique(dt$a))
However, I wish to calculate the number of obs for a large data file, so referencing each column by name is not desired. I am not a fan of eval(parse()).
### I want to determine the number of unique obs in
# each variable, for a large list of vars
lapply(names(df), function(x) {
length(unique(df[, x]))
})
### Unexpected result
length(unique(dt[, 'a', with = F])) # Returns 1
It seems to me the problem is that
dt[, 'a', with = F]
returns an object of class "data.table". It makes sense that the length of this object is 1, since it is a data.table containing 1 variable. We know that data.frames are really just lists of variables, and so in this case the length of the list is just 1.
Here's pseudo code for how I would remedy the solution, using the data.frame way:
for (x in names(data)) {
unique.obs <- length(unique(data[, x]))
if (unique.obs == 1) {
data[, x] <- NULL
}
}
Any insight as to how I may more efficiently ask for the number of unique observations by column in a data.table would be much appreciated. Alternatively, if you can recommend how to drop observations if there is only one unique observation within a data.table would be even better.
Update: uniqueN
As of version 1.9.6, there is a built in (optimized) version of this solution, the uniqueN function. Now this is as simple as:
dt[ , lapply(.SD, uniqueN)]
If you want to find the number of unique values in each column, something like
dt[, lapply(.SD, function(x) length(unique(x)))]
## a b c
## 1: 10 10 1
To get your function to work you need to use with=FALSE within [.data.table, or simply use [[ instead (read fortune(312) as well...)
lapply(names(df) function(x) length(unique(dt[, x, with = FALSE])))
or
lapply(names(df) function(x) length(unique(dt[[x]])))
will work
In one step
dt[,names(dt) := lapply(.SD, function(x) if(length(unique(x)) ==1) {return(NULL)} else{return(x)})]
# or to avoid calling `.SD`
dt[, Filter(names(dt), f = function(x) length(unique(dt[[x]]))==1) := NULL]
The approaches in the other answers are good. Another way to add to the mix, just for fun :
for (i in names(DT)) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]
or if there may be duplicate column names :
for (i in ncol(DT):1) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]
NB: (i) on the LHS of := is a trick to use the value of i rather than a column named "i".
Here is a solution to your core problem (I hope I got it right).
require(data.table)
### Create a data.table
dt <- data.table(a = 1:10,
b = letters[1:10],
d1 = "",
c = rep(1, times = 10),
d2 = "")
dt
a b d1 c d2
1: 1 a 1
2: 2 b 1
3: 3 c 1
4: 4 d 1
5: 5 e 1
6: 6 f 1
7: 7 g 1
8: 8 h 1
9: 9 i 1
10: 10 j 1
First, I introduce two columns d1 and d2 that have no values whatsoever. Those you want to delete, right? If so, I just identify those columns and select all other columns in the dt.
only_space <- function(x) {
length(unique(x))==1 && x[1]==""
}
bolCols <- apply(dt, 2, only_space)
dt[, (1:ncol(dt))[!bolCols], with=FALSE]
Somehow, I have the feeling that you could further simplify it...
Output:
a b c
1: 1 a 1
2: 2 b 1
3: 3 c 1
4: 4 d 1
5: 5 e 1
6: 6 f 1
7: 7 g 1
8: 8 h 1
9: 9 i 1
10: 10 j 1
There is an easy way to do that using "dplyr" library, and then use select function as follow:
library(dplyr)
newdata <- select(old_data, first variable,second variable)
Note that, you can choose as many variables as you like.
Then you will get the type of data that you want.
Many thanks,
Fadhah

Resources