Use of lapply to identify which bin a particular value lies in

Use of lapply to identify which bin a particular value lies in - r

The data set is this
badData <- list(c(296,310), c(330,335), c(350,565))
df <- data.frame(wavelength = seq(300,360,5.008667),
reflectance = seq(-1,-61,-5.008667))
df
wavelength reflectance
300.0000 -1.000000
305.0087 -6.008667
310.0173 -11.017334
315.0260 -16.026001
320.0347 -21.034668
325.0433 -26.043335
330.0520 -31.052002
335.0607 -36.060669
340.0693 -41.069336
345.0780 -46.078003
350.0867 -51.086670
355.0953 -56.095337
The orginal question was whether to identify if wavelength fell in any of the ranges given in badData
The solution offered is this
https://stackoverflow.com/a/52070363/1012249
my question is using a similar syntax, how does one identify which badData bin it falls into. Lets say badData were structured like this, and bins are non-overlapping.
badData <- data.frame(bin=c('a','b','c'),start= c(296,330,350),end=c(310.01,335,565))

Here is an example using fuzzy join:
library(fuzzyjoin)
df %>%
fuzzy_left_join(badData, #join badData to df
by = c("wavelength" = "start", #variables to join by
"wavelength" = "end"),
match_fun=list(`>=`, `<=`)) #functions to use for each par of variables so "wavelength" >= "start" and "wavelength" <= "end" is the logic here
#output
wavelength reflectance bin start end
1 300.0000 -1.000000 a 296 310.01
2 305.0087 -6.008667 a 296 310.01
3 310.0173 -11.017334 <NA> NA NA
4 315.0260 -16.026001 <NA> NA NA
5 320.0347 -21.034668 <NA> NA NA
6 325.0433 -26.043335 <NA> NA NA
7 330.0520 -31.052002 b 330 335.00
8 335.0607 -36.060669 <NA> NA NA
9 340.0693 -41.069336 <NA> NA NA
10 345.0780 -46.078003 <NA> NA NA
11 350.0867 -51.086670 c 350 565.00
12 355.0953 -56.095337 c 350 565.00

You don't need a loop. You can simply use cut:
badData <- data.frame(bin=c('a','b','c'),start= c(296,330,350),end=c(310.01,335,565))
df <- data.frame(wavelength = seq(300,360,5.008667),
reflectance = seq(-1,-61,-5.008667))
df$bins <- cut(df$wavelength, t(badData[, c("start", "end")]),
labels = head(c(t(cbind(as.character(badData$bin), "good"))), -1))
# wavelength reflectance bins
#1 300.0000 -1.000000 a
#2 305.0087 -6.008667 a
#3 310.0173 -11.017334 good
#4 315.0260 -16.026001 good
#5 320.0347 -21.034668 good
#6 325.0433 -26.043335 good
#7 330.0520 -31.052002 b
#8 335.0607 -36.060669 good
#9 340.0693 -41.069336 good
#10 345.0780 -46.078003 good
#11 350.0867 -51.086670 c
#12 355.0953 -56.095337 c
You haven't said which side of the intervals should be open or closed, but this can be adjusted.

Related

Create column from data on dynamic number of columns depending on availabity in R

Given a uncertain number of columns containing source values for the same variable I would like to create a column that defines the final value to be selected depending on source importance and availability.
Reproducible data:
set.seed(123)
actuals = runif(10, 500, 1000)
get_rand_vector <- function(){return (runif(10, 0.95, 1.05))}
get_na_rand_ixs <- function(){return (round(runif(5,0,10),0))}
df = data.frame("source_1" = actuals*get_rand_vector(),
"source_2" = actuals*get_rand_vector(),
"source_n" = actuals*get_rand_vector())
df[["source_1"]][get_na_rand_ixs()] <- NA
df[["source_2"]][get_na_rand_ixs()] <- NA
df[["source_n"]][get_na_rand_ixs()] <- NA
My manual solution is as follows:
df$available <- ifelse(
!is.na(df$source_1),
df$source_1,
ifelse(
!is.na(df$source_2),
df$source_2,
df$source_n
)
)
Given the desired result of:
source_1 source_2 source_n available
1 NA NA NA NA
2 NA NA 930.1242 930.1242
3 716.9981 NA 717.9234 716.9981
4 NA 988.0446 NA 988.0446
5 931.7081 NA 924.1101 931.7081
6 543.6802 533.6798 NA 543.6802
7 744.6525 767.4196 783.8004 744.6525
8 902.8788 955.1173 NA 902.8788
9 762.3690 NA 761.6135 762.3690
10 761.4092 702.6064 708.7615 761.4092
How could I automatically iterate over the available sources to set the data to be considered? Given in some cases n_sources could be 1,2,3..,7 and priority follows the natural order (1 > 2 >..)

Once you have all of the candidate vectors in order and in an appropriate data structure (e.g., data.frame or matrix), you can use apply to apply a function over the rows. In this case, we just look for the first non-NA value. Thus, after the first block of code above, you only need the following line:
df$available <- apply(df, 1, FUN = function(x) x[which(!is.na(x))[1]])

coalesce() from dplyr is designed for this:
library(dplyr)
df %>%
mutate(available = coalesce(!!!.))
source_1 source_2 source_n available
1 NA NA NA NA
2 NA NA 930.1242 930.1242
3 716.9981 NA 717.9234 716.9981
4 NA 988.0446 NA 988.0446
5 931.7081 NA 924.1101 931.7081
6 543.6802 533.6798 NA 543.6802
7 744.6525 767.4196 783.8004 744.6525
8 902.8788 955.1173 NA 902.8788
9 762.3690 NA 761.6135 762.3690
10 761.4092 702.6064 708.7615 761.4092

R - enter basic formula

I am new to R and struggling to understand its quirks. I'm trying to do something which should be really simple, but is turning out to be apparently very complicated.
I am used to Excel, SQL and Minitab, where you can enter a value in one column which includes references to other columns and parameters. However, R doesn't seem to be allowing me to do this.
I have a table with (currently) four columns:
Date Pallets Lt Tt
1 28/12/2011 491 NA NA
2 29/12/2011 385 NA 0.787890411
3 30/12/2011 662 NA NA
4 31/12/2011 28 NA NA
5 01/01/2012 46 NA NA
6 02/01/2012 403 NA NA
7 03/01/2012 282 NA NA
8 04/01/2012 315 NA NA
9 05/01/2012 327 NA NA
10 06/01/2012 458 NA NA
and have a parameter "beta", with a value which I have assigned as 0.0002.
All I want to do is assign a formula to rows 3:10 which is:
beta*(Pallets t - Pallets t-1)+(1-beta)*Tt t-1.
I thought that the appropriate code might be:
Table[3:10,4]<-beta*(Table[3:10,"Pallets"]-Table[2:9,"Pallets"])+(1-beta)*Table[2:9,"Tt"]
However, this doesn't work. The first time I enter this formula, it generates:
Date Pallets Lt Tt
1 28/12/2011 491 NA NA
2 29/12/2011 385 NA 0.7878904
3 30/12/2011 662 NA 0.8431328
4 31/12/2011 28 NA NA
5 01/01/2012 46 NA NA
6 02/01/2012 403 NA NA
7 03/01/2012 282 NA NA
8 04/01/2012 315 NA NA
9 05/01/2012 327 NA NA
10 06/01/2012 458 NA NA
So it's generated the correct answer for the second item in the series, but not for any of the subsequent values.
It seems as though R doesn't automatically update each row, and the relationship to each other row, when you enter a formula, as Excel does. Having said that, Excel actually would require me to enter the formula in cell [4,Tt], and then drag this down to all of the other cells. Perhaps R is the same, and there is an equivalent to "dragging down" which I need to do?
Finally, I also noticed that when I change the value of the beta parameter, through, e.g. beta<-0.5, and then print the Table values again, they are unchanged - so the table hasn't updated even though I have changed the value of the parameter.
Appreciate that these are basic questions, but I am very new to R.

In R, the computations are not made "cell by cell", but are vectorised - in your example, R takes the vectors Table[3:10,"Pallets"], Table[2:9,"Pallets"] and Table[2:9,"Tt"] as they are at the moment, computes the resulting vector, and finally assigns it to Table[3:10,4].
If you want to make some computations "cell by cell", you have to use the for loop:
beta <- 0.5
df <- data.frame(v1 = 1:12, v2 = 0)
for (i in 3:10) {
df[i, "v2"] <- beta * (df[i, "v1"] - df[i-1, "v1"]) + (1 - beta) * df[i-1, "v2"]
}
df
v1 v2
1 1 0.0000000
2 2 0.0000000
3 3 0.5000000
4 4 0.7500000
5 5 0.8750000
6 6 0.9375000
7 7 0.9687500
8 8 0.9843750
9 9 0.9921875
10 10 0.9960938
11 11 0.0000000
12 12 0.0000000
As it comes to your second question, R will never update any values on its own (imagine having set manual calculation in Excel). So you need to repeat the computations after changing beta.

Although it's generally a bad design, but you can iterate over rows in a loop:
Table$temp <- c(0,diff(Table$Palletes,1))
prevTt = 0
for (i in 1:10)
{
Table$Tt[i] = Table$temp * beta + (1-beta)*prevTt
prevTt = Table$Tt[i]
}
Table$temp <- NULL

When a variable switches from 1 to 2, delete some data from the other variables and average what's left?

I am analysing some data and need help.
Basically, I have a dataset that looks like this:
date <- seq(as.Date("2017-04-01"),as.Date("2017-05-09"),length.out=40)
switch <- c(rep(1:2,each=10),rep(1:2,each=10))
O2 <- runif(40,min=21.02,max=21.06)
CO2 <- runif(40,min=0.076,max=0.080)
test.data <- data.frame(date,switch,O2,CO2)
As can be seen, there's a switch column that switches between 1 and 2 every 10 data points. I want to write a code that does: when the "switch" column changes its value (from 1 to 2, or 2 to 1), delete the first 5 rows of data after the switch (i.e. leaving the 5 last data points for all the 4 variables), average the rest of the data points for O2 and CO2, and put them in 2 new columns (avg.O2 and avg.CO2) before the next switch. Then repeat this process until the end.
It's quite easy to do manually on paper or excel, but my real dataset would comprise thousands of data points and I would like to use R to do it automatically for me. So anyone has any ideas that could help me?

Please find my edits which should work for both regular and irregular
date <- seq(as.Date("2017-04-01"),as.Date("2017-05-09"),length.out=40)
switch <- c(rep(1:2,each=10),rep(1:2,each=10))
O2 <- runif(40,min=21.02,max=21.06)
CO2 <- runif(40,min=0.076,max=0.080)
test.data <- data.frame(date,switch,O2,CO2)
CleanMachineData <- function(Data, SwitchData, UnreliableRows = 5){
# First, we can properly turn your switch column into a grouping column (1,2,1,2)->(1,2,3,4)
grouplength <- rle(Data[,"switch"])$lengths
# mapply lets us input vector arguments into typically one/first-element only argument functions.
# In this case we create a sequence of lengths (output is a list/vector)
grouping <- mapply(seq, grouplength)
# Here we want it to become a single vector representing groups
groups <- mapply(rep, 1:length(grouplength), each = grouplength)
# if frequency was irregular, it will be a list, if regular it will be a matrix
# convert either into a vector by doing as follows:
if(class(grouping) == "list"){
groups <- unlist(groups)
} else {
groups <- as.vector(groups)
}
Data$group <- groups
#
# vector of the first row of each new switch (except the starting 0)
switchRow <- c(0,which(abs(diff(SwitchData)) == 1))+1
# I use "as.vector" to turn the matrix output of mapply into a sequence of numbers.
# "ToRemove" will have all the row numbers to get rid of from your original data, except for what happens before (in this case) row 10
ToRemove <- c(1:UnreliableRows, as.vector(mapply(seq, switchRow, switchRow+(UnreliableRows)-1)))
# I concatenate the missing beginning (1,2,3,4,5) and theToRemove them with c() and then remove them from n with "-"
Keep <- seq(nrow(Data))[-c(1:UnreliableRows,ToRemove)]
# Create the new data, (in case you don't know: data[<ROW>,<COLUMN>])
newdat <- Data[-ToRemove,]
# print the results
newdat
}
dat <- CleanMachineData(test.data, test.data$switch, 5)
dat
date switch O2 CO2 group
6 2017-04-05 1 21.03922 0.07648886 1
7 2017-04-06 1 21.04071 0.07747368 1
8 2017-04-07 1 21.05742 0.07946615 1
9 2017-04-08 1 21.04673 0.07782362 1
10 2017-04-09 1 21.04966 0.07936446 1
16 2017-04-15 2 21.02526 0.07833825 2
17 2017-04-16 2 21.04511 0.07747774 2
18 2017-04-17 2 21.03165 0.07662803 2
19 2017-04-18 2 21.03252 0.07960098 2
20 2017-04-19 2 21.04032 0.07892145 2
26 2017-04-25 1 21.03691 0.07691438 3
27 2017-04-26 1 21.05846 0.07857017 3
28 2017-04-27 1 21.04128 0.07891908 3
29 2017-04-28 1 21.03837 0.07817021 3
30 2017-04-29 1 21.02334 0.07917546 3
36 2017-05-05 2 21.02890 0.07723042 4
37 2017-05-06 2 21.04606 0.07979641 4
38 2017-05-07 2 21.03822 0.07985775 4
39 2017-05-08 2 21.04136 0.07781525 4
40 2017-05-09 2 21.05375 0.07941123 4
aggregate(cbind(O2,CO2) ~ group, dat, mean)
group O2 CO2
1 1 21.04675 0.07812336
2 2 21.03497 0.07819329
3 3 21.03967 0.07834986
4 4 21.04166 0.07882221
# crazier, irregular switching
test.data2 <- test.data
test.data2$switch <- unlist(mapply(rep, 1:2, times = 1, each = c(10,8,10,5,3,10)))[1:20]
dat2 <- CleanMachineData(test.data2, test.data2$switch, 5)
dat2
date switch O2 CO2 group
6 2017-04-05 1 21.03922 0.07648886 1
7 2017-04-06 1 21.04071 0.07747368 1
8 2017-04-07 1 21.05742 0.07946615 1
9 2017-04-08 1 21.04673 0.07782362 1
10 2017-04-09 1 21.04966 0.07936446 1
16 2017-04-15 2 21.02526 0.07833825 2
17 2017-04-16 2 21.04511 0.07747774 2
18 2017-04-17 2 21.03165 0.07662803 2
24 2017-04-23 1 21.05658 0.07669662 3
25 2017-04-24 1 21.04452 0.07983165 3
26 2017-04-25 1 21.03691 0.07691438 3
27 2017-04-26 1 21.05846 0.07857017 3
28 2017-04-27 1 21.04128 0.07891908 3
29 2017-04-28 1 21.03837 0.07817021 3
30 2017-04-29 1 21.02334 0.07917546 3
36 2017-05-05 2 21.02890 0.07723042 4
37 2017-05-06 2 21.04606 0.07979641 4
38 2017-05-07 2 21.03822 0.07985775 4
# You can try removing a vector with the following
lapply(5:7, function(x) {
dat <- CleanMachineData(test.data2, test.data2$switch, x)
list(data = dat, means = aggregate(cbind(O2,CO2)~group, dat, mean))
})

Use
test.data[rep(c(FALSE, TRUE), each=5),]
to select always the last five rows from the group of 10 rows.
Then you can use aggregate:
d2 <- test.data[rep(c(FALSE, TRUE), each=5),]
aggregate(cbind(O2, CO2) ~ 1, data=d2, FUN=mean)
If you want the average for every 5-rows-group:
aggregate(cbind(O2, CO2) ~ gl(k=5, n=nrow(d2)/5L), data=d2, FUN=mean)
Here is a generalization for the situation of arbitrary number of rows in test.data:
stay <- rep(c(FALSE, TRUE), each=5, length.out=nrow(test.data))
d2 <- test.data[stay,]
group <- gl(k=5, n=nrow(d2)/5L+1L, length=nrow(d2))
aggregate(cbind(O2, CO2) ~ group, data=d2, FUN=mean)
Here is a variant for mixing the data with the averages:
group <- gl(k=10, n=nrow(test.data)/10L+1L, length=nrow(test.data))
L <- split(test.data, group)
mySummary <- function(x) {
if (nrow(x) <= 5) return(NULL)
x <- x[-(1:5),]
d.avg <- aggregate(cbind(O2, CO2) ~ 1, data=x, FUN=mean)
rbind(x, cbind(date=NA, switch=-1, d.avg))
}
lapply(L, mySummary) # as list of dataframes
do.call(rbind, lapply(L, mySummary)) # as one dataframe

Insert NA's in case there are no observations when using subset() and then dcast or tapply

I have the following data frame (this is only the head of the data frame). The ID column is subject (I have more subjects in the data frame, not only subject #99). I want to calculate the mean "rt" by "subject" and "condition" only for observations that have z.score (in absolute values) smaller than 1.
> b
subject rt ac condition z.score
1 99 1253 1 200_9 1.20862682
2 99 1895 1 102_2 2.95813507
3 99 1049 1 68_1 1.16862102
4 99 1732 1 68_9 2.94415384
5 99 765 1 34_9 -0.63991180
7 99 1016 1 68_2 -0.03191493
I know I can to do it using tapply or dcast (from reshape2) after subsetting the data:
b1 <- subset(b, abs(z.score) < 1)
b2 <- dcast(b1, subject~condition, mean, value.var = "rt")
subject 34_1 34_2 34_9 68_1 68_2 68_9 102_1 102_2 102_9 200_1 200_2 200_9
1 99 1028.5714 957.5385 861.6818 837.0000 969.7222 856.4000 912.5556 977.7273 858.7800 1006.0000 1015.3684 913.2449
2 5203 957.8889 815.2500 845.7750 933.0000 893.0000 883.0435 926.0000 879.2778 813.7308 804.2857 803.8125 843.7200
3 5205 1456.3333 1008.4286 850.7170 1142.4444 910.4706 998.4667 935.2500 980.9167 897.4681 1040.8000 838.7917 819.9710
4 5306 1022.2000 940.5882 904.6562 1525.0000 1216.0000 929.5167 955.8571 981.7500 902.8913 997.6000 924.6818 883.4583
5 5307 1396.1250 1217.1111 1044.4038 1055.5000 1115.6000 980.5833 1003.5714 1482.8571 941.4490 1091.5556 1125.2143 989.4918
6 5308 659.8571 904.2857 966.7755 960.9091 1048.6000 904.5082 836.2000 1753.6667 926.0400 870.2222 1066.6667 930.7500
In the example above for b1 each of the subjects had observations that met the subset demands.
However, it can be that for a certain subject I won't have observations after I subset. In this case I want to get NA in b2 for that subject in the specific condition in which he doesn't have observations that meet the subset demands. Does anyone have an idea for a way to do that?
Any help will be greatly appreciated.
Best,
Ayala

There is a drop argument in dcast that you can use in this situation, but you'll need to convert subject to a factor.
Here is a dataset with a second subject ID that has no values that meet your condition that the absolute value of z.score is less than one.
library(reshape2)
bb = data.frame(subject=c(99,99,99,99,99,11,11,11), rt=c(100,150,2,4,10,15,1,2),
ac=rep(1,8), condition=c("A","A","B","D","C","C","D","D"),
z.score=c(0.2,0.3,0.2,0.3,.2,2,2,2))
If you reshape this to a wide format with dcast, you lose subject number 11 even with the drop argument.
dcast(subset(bb, abs(z.score) < 1), subject ~ condition, fun = mean,
value.var = "rt", drop = FALSE)
subject A B C D
1 99 125 2 10 4
Make subject a factor.
bb$subject = factor(bb$subject)
Now you can dcast with drop = FALSE to keep all subjects in the wide dataset.
dcast(subset(bb, abs(z.score) < 1), subject ~ condition, fun = mean,
value.var = "rt", drop = FALSE)
subject A B C D
1 11 NaN NaN NaN NaN
2 99 125 2 10 4
To get NA instead of NaN you can use the fill argument.
dcast(subset(bb, abs(z.score) < 1), subject ~ condition, fun = mean,
value.var = "rt", drop = FALSE, fill = as.numeric(NA))
subject A B C D
1 11 NA NA NA NA
2 99 125 2 10 4

Is it the following you are after? I created a similar dataset "bb"
library("plyr") ###needed for . function below
bb<- data.frame(subject=c(99,99,99,99,99,11,11,11),rt=c(100,150,2,4,10,15,1,2), ac=rep(1,8) ,condition=c("A","A","B","D","C","C","D","D"), z.score=c(0.2,0.3,0.2,0.3,1.5,-0.3,0.8,0.7))
bb
subject rt ac condition z.score
#1 99 100 1 A 0.2
#2 99 150 1 A 0.3
#3 99 2 1 B 0.2
#4 99 4 1 D 0.3
#5 99 10 1 C 1.5
#6 11 15 1 C -0.3
#7 11 1 1 D 0.8
#8 11 2 1 D 0.7
Then you call dcast with subset included:
cc<-dcast(bb,subject~condition, mean, value.var = "rt",subset = .(abs(z.score)<1))
cc
subject A B C D
#1 11 NaN NaN 15 1.5
#2 99 125 2 NaN 4.0

How to use sapply - switch logic

I have data frame that I am using for a small educational project.
EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
192527 URBAN/SMALL STREAM FLOODING 0.0 5 0
192938 HEAVY SNOW 1.7 5 0
193995 HAIL 30.0 5 25 M
194223 THUNDERSTORM WINDS 0.1 5 0
195672 THUNDERSTORM WINDS 0.0 5 0
198497 THUNDERSTORM WINDS 10.0 5 0
My objective is to create a new column named PropAmtDmg and takes the following form.
If PROPDMGEXP = "5" then 5 * PROPDMG
t1$PropAmtDmg <- ifelse(t1$PROPDMGEXP == "7", t1$PROPDMG * 7,
ifelse(t1$PROPDMGEXP == "5", t1$PROPDMG * 5,
0))
I might of more cases than just two that I mentioned.
I would like to do this in sapply.

I would like to suggest the use of data.table for this task. data.table is a package that enhances data frames inherent in R. It is very fast. The benefit of this is there is not constant recopying of data so that if your data is large, this is memory efficient. Let's assume that your data frame is called dfr:
require(data.table)
set.seed(123) #set the seed so this can be replicated
dtb = data.table(PROPDMGEXP = sample(1:10, 10), PROPDMG = sample(1:10,10)) #sample data.table
dtb[(PROPDMGEXP %in% c(5,7)),rslt:=PROPDMG*PROPDMGEXP]
You are done. Here is the result:
PROPDMGEXP PROPDMG rslt
1: 3 10 NA
2: 8 5 NA
3: 4 6 NA
4: 7 9 63
5: 6 1 NA
6: 1 7 NA
7: 10 8 NA
8: 9 4 NA
9: 2 3 NA
10: 5 2 10
Note: if you want to make all the other entries 0 you can do this instead:
dtb[,rslt:=0][(PROPDMGEXP %in% c(5,7)),rslt:=PROPDMG*PROPDMGEXP]

You can aggregate all conditions in a unique one like this :
transform(t1,PropAmtDmg=ifelse(PROPDMGEXP %in% c(5,7),PROPDMG*PROPDMGEXP,0))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Use of lapply to identify which bin a particular value lies in - r

Related

Create column from data on dynamic number of columns depending on availabity in R

R - enter basic formula

When a variable switches from 1 to 2, delete some data from the other variables and average what's left?

Insert NA's in case there are no observations when using subset() and then dcast or tapply

How to use sapply - switch logic

Categories

Resources