data.table applying function by row in R - r

I have a function that takes arguments and I want to apply it over each row in a data.table.
My data looks like follows:
Row Temp Humidity Elevation
1 10 0.5 1000
2 25 1.5 2000
3 28 2.0 1500
and I have a function
myfunc <- function(x, n_features=3){
# Here x represents each row of a data table.
# Feature names are important for me as my actual function is operating on feature names
return(x[,Temp]+x[,Humidity]+(x[,Elevation]*(n_features)))
}
What I want my output to look like is
Row Temp Humidity Elevation myfuncout
1 10 0.5 1000 3010.5
2 25 1.5 2000 6026.5
3 28 2.0 1500 4530
I have tried df[, myfuncout := myfunc(x, n_features=3), by=.I] but this didnt work.
Also, not sure if I have to use .SD here to make this work...
Any inputs here on how I can achieve this?
Thanks!

If it is a data.table, we can use
myfunc <- function(dt, n_features = 3) {
dt[, out := (Temp + Humidity) + (Elevation * n_features)]
dt
}
myfunc(df, 3)
-output
df
# Row Temp Humidity Elevation out
#1: 1 10 0.5 1000 3010.5
#2: 2 25 1.5 2000 6026.5
#3: 3 28 2.0 1500 4530.0

Related

R - Create random subsamples of size n for multiple sample groups

I have a large data set of samples that belong to different groups and differ in the area covered. The structure of the data set is simplified below. I now would like to create pooled samples (Subgroups) for each Group where the area covered by each Subgroup equates to a specified area (e.g. 20). Samples should be allocated randomly and without replacement to each Subgroup and the number of the Subgroup should be listed in a new column at the end of the data frame.
SampleID Group Area Subgroup
1 A 1.5 1
2 A 3.8 2
3 A 6 4
4 A 1.9 1
5 A 1.5 3
6 A 4.1 1
7 A 3.7 1
8 A 4.5 3
...
300 B 1.2 1
301 B 3.8 1
302 B 4.1 4
303 B 2.6 3
304 B 3.1 5
305 B 3.5 3
306 B 2.1 2
...
2000 S 2.7 5
...
I am currently using the ‘cumsum’ command to create the Subgroups, using the code below.
dat <- read.table("Pooling_Test.txt", header = TRUE, sep = "\t")
dat$CumArea <- cumsum(dat$Area)
dat$Diff_CumArea <- c(0, head(cumsum(dat$Area), -1))
dat$Sample_Int_1 <- "0"
dat$Sample_End <- "0"
current.sum <- 0
for (c in 1:nrow(dat)) {
current.sum <- current.sum + dat[c, "Area"]
dat[c, "Diff_CumArea"] <- current.sum
if (current.sum >= 20) {
dat[c, "Sample_Int_1"] <- "1"
dat[c, "Sample_End"] <- "End"
current.sum <- 0
dat$Sample_Int_2 <- cumsum(dat$Sample_Int_1)+1
dat$Sample_Final <- dat$Sample_Int_2
for (d in 1:nrow(dat)) {
if (dat$Sample_End[d] == 'End')
dat$Subgroup[d] <- dat$Sample_Int_2[d]-1
else 0 }
}}
write.csv(dat, file = 'Pooling_Test_Output.csv', row.names = FALSE)
The resultant data frame shows what I want (see below). However, there are a couple of steps I would like to improve. First, I have problems including a command for choosing samples randomly from each Group, so I currently randomise the order of samples before loading the data frame into R. Secondly, in the output table the Subgroups are numbered consecutively, but I would like to start the Subgroup numbering with 1 for each new Group. Has anybody any advice on how to achieve this?
SampleID Group CumArea Subgroups
1 A 1.5 1
77 A 4.6 1
6 A 9.3 1
43 A 16.4 1
17 A 19.5 1
67 A 2.1 2
4 A 4.3 2
32 A 8.9 2
...
300 B 4.5 10
257 B 6.8 10
397 B 10.6 10
344 B 14.5 10
367 B 16.7 10
303 B 20.1 10
306 B 1.5 11
...
A few functions in the dplyr package make this fairly straightforward. You can use slice to randomize the data, group_by to perform computations at the group level, and mutate to create new variables. If you chain the functions together with the %>% operator, I believe the solution would look something like this, assuming that you want groups that add up to 20.
install.packages("dplyr") #If you haven't used dplyr before
library(dplyr)
dat %>%
group_by(Group) %>%
slice(sample(1:n())) %>%
mutate(CumArea = cumsum(Area), SubGroup = ceiling(CumArea / 20))

how to subtract rows in a data frame in R depending on difference between values?

I have a data frame that is in the following format
df <- data.frame(name=LETTERS[1:5], location=c(2000,2021,4532,1931,3457),
value=c(1,0,1,1,0))
name location value
A 2000 1
B 2021 0
C 4532 1
D 1931 1
E 3457 0
There are approximately a million rows in the data frame. How would I create a new dataframe that has the distance between every location if the locations are within 1000 of each other also checks to see if the values are both one for both locations?
For the above dataset, the dataframe would only have three rows in it with values of 21 (absolute value of 2000 - 2021), 69 (absolute value of 2000 - 1931), and 90 (abs. value of 2021-1931) because those are the only differences that are less than 1000. It would also have a column of 0 (because A and B values are not 1 and 1), 1 (because A and C values are 1 and 1), and 0 (because B and C are not 1 and 1). So it would look like:
21 0
69 1
90 0
I've tried using loops but since there are so many rows, it's inefficient. Is there some built in function that I should use to do this faster?
Thanks in advance.
library(sqldf)
sqldf("
select a.location
, b.location
, a.location - b.location as locdiff
, a.value*b.value as value
from df a
inner join df b
on a.location - b.location between 1 and 1000
")
This gives
a.location b.location locdiff value
1 2000 1931 69 1
2 2021 2000 21 0
3 2021 1931 90 0
Or with data.table. This is just #MKR's solution but adding a column to avoid a large join result. Not sure if it's possible to achieve this without creating a new column.
setDT(df)
df[, loc2 := location - 1000]
df[df
, .( locdiff = i.location - x.location
, locationA = i.location
, locationB = x.location
, value = x.value*i.value)
, on = .(location >= loc2
, location < location)
, nomatch = 0]
gives
locdiff locationA locationB value
1: 69 2000 1931 1
2: 90 2021 1931 0
3: 21 2021 2000 0
I agree with #Gregor comment where he mentioned sqldf to be better option to in above scenario in the sense that it avoid cartesian join of million records.
But I tried to optimize data.table based solution by first joining on x.location > i.location and then filtering on diff <=1000.
df <- data.frame(name=LETTERS[1:5], location=c(2000,2021,4532,1931,3457),
value=c(1,0,1,1,0))
library(data.table)
setDT(df)
df[df,.(name, diff = x.location - i.location, value = x.value*i.value),
on=.(location > location), nomatch=0][diff<=1000]
# name diff value
# 1: B 21 0
# 2: A 69 1
# 3: B 90 0

When a variable switches from 1 to 2, delete some data from the other variables and average what's left?

I am analysing some data and need help.
Basically, I have a dataset that looks like this:
date <- seq(as.Date("2017-04-01"),as.Date("2017-05-09"),length.out=40)
switch <- c(rep(1:2,each=10),rep(1:2,each=10))
O2 <- runif(40,min=21.02,max=21.06)
CO2 <- runif(40,min=0.076,max=0.080)
test.data <- data.frame(date,switch,O2,CO2)
As can be seen, there's a switch column that switches between 1 and 2 every 10 data points. I want to write a code that does: when the "switch" column changes its value (from 1 to 2, or 2 to 1), delete the first 5 rows of data after the switch (i.e. leaving the 5 last data points for all the 4 variables), average the rest of the data points for O2 and CO2, and put them in 2 new columns (avg.O2 and avg.CO2) before the next switch. Then repeat this process until the end.
It's quite easy to do manually on paper or excel, but my real dataset would comprise thousands of data points and I would like to use R to do it automatically for me. So anyone has any ideas that could help me?
Please find my edits which should work for both regular and irregular
date <- seq(as.Date("2017-04-01"),as.Date("2017-05-09"),length.out=40)
switch <- c(rep(1:2,each=10),rep(1:2,each=10))
O2 <- runif(40,min=21.02,max=21.06)
CO2 <- runif(40,min=0.076,max=0.080)
test.data <- data.frame(date,switch,O2,CO2)
CleanMachineData <- function(Data, SwitchData, UnreliableRows = 5){
# First, we can properly turn your switch column into a grouping column (1,2,1,2)->(1,2,3,4)
grouplength <- rle(Data[,"switch"])$lengths
# mapply lets us input vector arguments into typically one/first-element only argument functions.
# In this case we create a sequence of lengths (output is a list/vector)
grouping <- mapply(seq, grouplength)
# Here we want it to become a single vector representing groups
groups <- mapply(rep, 1:length(grouplength), each = grouplength)
# if frequency was irregular, it will be a list, if regular it will be a matrix
# convert either into a vector by doing as follows:
if(class(grouping) == "list"){
groups <- unlist(groups)
} else {
groups <- as.vector(groups)
}
Data$group <- groups
#
# vector of the first row of each new switch (except the starting 0)
switchRow <- c(0,which(abs(diff(SwitchData)) == 1))+1
# I use "as.vector" to turn the matrix output of mapply into a sequence of numbers.
# "ToRemove" will have all the row numbers to get rid of from your original data, except for what happens before (in this case) row 10
ToRemove <- c(1:UnreliableRows, as.vector(mapply(seq, switchRow, switchRow+(UnreliableRows)-1)))
# I concatenate the missing beginning (1,2,3,4,5) and theToRemove them with c() and then remove them from n with "-"
Keep <- seq(nrow(Data))[-c(1:UnreliableRows,ToRemove)]
# Create the new data, (in case you don't know: data[<ROW>,<COLUMN>])
newdat <- Data[-ToRemove,]
# print the results
newdat
}
dat <- CleanMachineData(test.data, test.data$switch, 5)
dat
date switch O2 CO2 group
6 2017-04-05 1 21.03922 0.07648886 1
7 2017-04-06 1 21.04071 0.07747368 1
8 2017-04-07 1 21.05742 0.07946615 1
9 2017-04-08 1 21.04673 0.07782362 1
10 2017-04-09 1 21.04966 0.07936446 1
16 2017-04-15 2 21.02526 0.07833825 2
17 2017-04-16 2 21.04511 0.07747774 2
18 2017-04-17 2 21.03165 0.07662803 2
19 2017-04-18 2 21.03252 0.07960098 2
20 2017-04-19 2 21.04032 0.07892145 2
26 2017-04-25 1 21.03691 0.07691438 3
27 2017-04-26 1 21.05846 0.07857017 3
28 2017-04-27 1 21.04128 0.07891908 3
29 2017-04-28 1 21.03837 0.07817021 3
30 2017-04-29 1 21.02334 0.07917546 3
36 2017-05-05 2 21.02890 0.07723042 4
37 2017-05-06 2 21.04606 0.07979641 4
38 2017-05-07 2 21.03822 0.07985775 4
39 2017-05-08 2 21.04136 0.07781525 4
40 2017-05-09 2 21.05375 0.07941123 4
aggregate(cbind(O2,CO2) ~ group, dat, mean)
group O2 CO2
1 1 21.04675 0.07812336
2 2 21.03497 0.07819329
3 3 21.03967 0.07834986
4 4 21.04166 0.07882221
# crazier, irregular switching
test.data2 <- test.data
test.data2$switch <- unlist(mapply(rep, 1:2, times = 1, each = c(10,8,10,5,3,10)))[1:20]
dat2 <- CleanMachineData(test.data2, test.data2$switch, 5)
dat2
date switch O2 CO2 group
6 2017-04-05 1 21.03922 0.07648886 1
7 2017-04-06 1 21.04071 0.07747368 1
8 2017-04-07 1 21.05742 0.07946615 1
9 2017-04-08 1 21.04673 0.07782362 1
10 2017-04-09 1 21.04966 0.07936446 1
16 2017-04-15 2 21.02526 0.07833825 2
17 2017-04-16 2 21.04511 0.07747774 2
18 2017-04-17 2 21.03165 0.07662803 2
24 2017-04-23 1 21.05658 0.07669662 3
25 2017-04-24 1 21.04452 0.07983165 3
26 2017-04-25 1 21.03691 0.07691438 3
27 2017-04-26 1 21.05846 0.07857017 3
28 2017-04-27 1 21.04128 0.07891908 3
29 2017-04-28 1 21.03837 0.07817021 3
30 2017-04-29 1 21.02334 0.07917546 3
36 2017-05-05 2 21.02890 0.07723042 4
37 2017-05-06 2 21.04606 0.07979641 4
38 2017-05-07 2 21.03822 0.07985775 4
# You can try removing a vector with the following
lapply(5:7, function(x) {
dat <- CleanMachineData(test.data2, test.data2$switch, x)
list(data = dat, means = aggregate(cbind(O2,CO2)~group, dat, mean))
})
Use
test.data[rep(c(FALSE, TRUE), each=5),]
to select always the last five rows from the group of 10 rows.
Then you can use aggregate:
d2 <- test.data[rep(c(FALSE, TRUE), each=5),]
aggregate(cbind(O2, CO2) ~ 1, data=d2, FUN=mean)
If you want the average for every 5-rows-group:
aggregate(cbind(O2, CO2) ~ gl(k=5, n=nrow(d2)/5L), data=d2, FUN=mean)
Here is a generalization for the situation of arbitrary number of rows in test.data:
stay <- rep(c(FALSE, TRUE), each=5, length.out=nrow(test.data))
d2 <- test.data[stay,]
group <- gl(k=5, n=nrow(d2)/5L+1L, length=nrow(d2))
aggregate(cbind(O2, CO2) ~ group, data=d2, FUN=mean)
Here is a variant for mixing the data with the averages:
group <- gl(k=10, n=nrow(test.data)/10L+1L, length=nrow(test.data))
L <- split(test.data, group)
mySummary <- function(x) {
if (nrow(x) <= 5) return(NULL)
x <- x[-(1:5),]
d.avg <- aggregate(cbind(O2, CO2) ~ 1, data=x, FUN=mean)
rbind(x, cbind(date=NA, switch=-1, d.avg))
}
lapply(L, mySummary) # as list of dataframes
do.call(rbind, lapply(L, mySummary)) # as one dataframe

Search for value within a range of values in two separate vectors

This is my first time posting to Stack Exchange, my apologies as I'm certain I will make a few mistakes. I am trying to assess false detections in a dataset.
I have one data frame with "true" detections
truth=
ID Start Stop SNR
1 213466 213468 10.08
2 32238 32240 10.28
3 218934 218936 12.02
4 222774 222776 11.4
5 68137 68139 10.99
And another data frame with a list of times, that represent possible 'real' detections
possible=
ID Times
1 32239.76
2 32241.14
3 68138.72
4 111233.93
5 128395.28
6 146180.31
7 188433.35
8 198714.7
I am trying to see if the values in my 'possible' data frame lies between the start and stop values. If so I'd like to create a third column in possible called "between" and a column in the "truth" data frame called "match. For every value from possible that falls between I'd like a 1, otherwise a 0. For all of the rows in "truth" that find a match I'd like a 1, otherwise a 0.
Neither ID, not SNR are important. I'm not looking to match on ID. Instead I wand to run through the data frame entirely. Output should look something like:
ID Times Between
1 32239.76 0
2 32241.14 1
3 68138.72 0
4 111233.93 0
5 128395.28 0
6 146180.31 1
7 188433.35 0
8 198714.7 0
Alternatively, knowing if any of my 'possible' time values fall within 2 seconds of start or end times would also do the trick (also with 1/0 outputs)
(Thanks for the feedback on the original post)
Thanks in advance for your patience with me as I navigate this system.
I think this can be conceptulised as a rolling join in data.table. Take this simplified example:
truth
# id start stop
#1: 1 1 5
#2: 2 7 10
#3: 3 12 15
#4: 4 17 20
#5: 5 22 26
possible
# id times
#1: 1 3
#2: 2 11
#3: 3 13
#4: 4 28
setDT(truth)
setDT(possible)
melt(truth, measure.vars=c("start","stop"), value.name="times")[
possible, on="times", roll=TRUE
][, .(id=i.id, truthid=id, times, status=factor(variable, labels=c("in","out")))]
# id truthid times status
#1: 1 1 3 in
#2: 2 2 11 out
#3: 3 3 13 in
#4: 4 5 28 out
The source datasets were:
truth <- read.table(text="id start stop
1 1 5
2 7 10
3 12 15
4 17 20
5 22 26", header=TRUE)
possible <- read.table(text="id times
1 3
2 11
3 13
4 28", header=TRUE)
I'll post a solution that I'm pretty sure works like you want it to in order to get you started. Maybe someone else can post a more efficient answer.
Anyway, first I needed to generate some example data - next time please provide this from your own data set in your post using the function dput(head(truth, n = 25)) and dput(head(possible, n = 25)). I used:
#generate random test data
set.seed(7)
truth <- data.frame(c(1:100),
c(sample(5:20, size = 100, replace = T)),
c(sample(21:50, size = 100, replace = T)))
possible <- data.frame(c(sample(1:15, size = 15, replace = F)))
colnames(possible) <- "Times"
After getting sample data to work with; the following solution provides what I believe you are asking for. This should scale directly to your own dataset as it seems to be laid out. Respond below if the comments are unclear.
#need the %between% operator
library(data.table)
#initialize vectors - 0 or false by default
truth.match <- c(rep(0, times = nrow(truth)))
possible.between <- c(rep(0, times = nrow(possible)))
#iterate through 'possible' dataframe
for (i in 1:nrow(possible)){
#get boolean vector to show if any of the 'truth' rows are a 'match'
match.vec <- apply(truth[, 2:3],
MARGIN = 1,
FUN = function(x) {possible$Times[i] %between% x})
#if any are true then update the match and between vectors
if(any(match.vec)){
truth.match[match.vec] <- 1
possible.between[i] <- 1
}
}
#i think this should be called anyMatch for clarity
truth$anyMatch <- truth.match
#similarly; betweenAny
possible$betweenAny <- possible.between

row minus row within different list R

How can I calculate the difference between different rows within different list?
and different list have different dimensions.
I use the code as follows
names(ri1)
[1] "Sedol" "code" "ri" "date"
ri1<-ri1[order(ri1$Sedol,ri1$date),]
sri<-split(ri1,ri1$Sedol)
ri1$r<-as.vector(sapply(seq_along(sri), function(x) diff(c(0, sri[[x]][,3]))))
however it shows the result
"Error in `$<-.data.frame`(`*tmp*`, "r", value = list(c(100, 0.00790000000000646, :
replacement has 1485 rows, data has 4687655"
for example
I have three lists
date ri
1990 1
1991 2
1992 3
date ri
1990 1
1991 2
1992 3
1993 4
date ri
1990 1
1991 2
I want the results like
date ri r
1990 1 0%
1991 2 100%
1992 3 100%
date ri r
1990 1 0%
1991 2 100%
1992 3 100%
1993 4 100%
date ri r
1990 1 0%
1991 2 100%
notice: r= r(t+1)/r(t)-1
Using diff and lapply you can get something like
# I generate some data
dat1 <- data.frame(date = seq(1990,1999,length.out=5),ri = seq(1,10,length.out=5))
dat2 <- data.frame(date = seq(1990,1999,length.out=5),ri=seq(1,5,length.out=5))
# I put the data.frame in a list
ll <- list(dat1,dat2)
# I use lapply:
ll <- lapply(ll,function(dat){
# I apply the formula you give in a vector version
# maybe you need only diff in percent?
dat$r <- round(c(0,diff(dat$ri))/dat$ri*100)
dat
})
ll
[[1]]
date ri r
1 1990.00 1.00 0
2 1992.25 3.25 69
3 1994.50 5.50 41
4 1996.75 7.75 29
5 1999.00 10.00 22
[[2]]
date ri r
1 1990.00 1 0
2 1992.25 2 50
3 1994.50 3 33
4 1996.75 4 25
5 1999.00 5 20
You should use a combination of head and tail as follows:
r.fun <- function(ri) c(0, tail(ri, -1) / head(ri, -1) - 1)
lapply(sri1, transform, r = r.fun(ri))
If your goal is to recombine (rbind) your data afterwards, then know that you can split/apply/combine everything within a single call to ave from the base package, or ddply from the plyr package:
transform(ri1, r = ave(ri, Sedol, FUN = r.fun))
or
library(plyr)
ddply(ri1, "Sedol", transform, r = r.fun(ri))
Edit: If you want the output to be in XX% as in your example, replace r.fun with:
r.fun <- function(ri) paste0(round(100 * c(0, tail(ri, -1) / head(ri, -1) - 1)), "%")

Resources