Apply formula to cells based on value in other column - r

I have a dataframe with 10 columns and 60 rows.
The first 2 columns contain treatment size (sq. m) and observation time (min), the other 8 show the 8 observed species as count data. All are integers.
I want to apply a correction to the observations:
Divide all observation counts by the root of the treatment size
Divide all observation counts by the square root of the observation time
Probably easily done with apply() and function() but I am not very familiar... Anyone who could help?
Some exemplary code (just 2 species here):
dftot <- data.frame(matrix(NA, nrow = 60, ncol = 4))
colnames(dftot)<- c("size", "time", "guppy", "barracuda")
set.seed(123)
dftot$size <- sample(60, size = 60, replace = TRUE)
dftot$time <- sample(60, size = 60, replace = TRUE)
dftot$guppy <- sample(60, size = 60, replace = TRUE)
dftot$barracuda <- sample(60, size = 60, replace = TRUE)

Is this what you want?
dftot %>%
mutate(across(guppy:last_col(), ~.x/sqrt(size))) %>%
mutate(across(guppy:last_col(), ~.x/sqrt(time)))
Output:
size time guppy barracuda
1 31 12 2.7997696 1.76281788
2 15 13 4.1534663 3.43735140
3 51 18 0.4950738 0.42906394
4 14 33 0.9770084 0.88396000
5 3 57 2.8294607 4.28242703
6 42 27 0.2375655 1.54417606

Related

Create a vector in a LOOP with more conditions

I have a problem a little complicate. I have two database BD and BD2. For every row in BD I want to search in entire BD2 and obtain some info: Sum, Mean, Sd etc.
With Sum I succeeded to make it work, but for Mean, Sd, Median I want to obtain the vector and after that apply these functions.
To be clearer, here is the code. For sum it worked. But I want now to save in a vector the values that met the conditions and after calculate Mean,Sd,Median. I tried to make the vector out of the base BD but nothing work or is something trickie and I can't figure it out.
for (i in 1:dim(BD)[1])
{
for (j in 1:dim(BD2)[1])
{
if((BD$Start.Date[i]<BD2$X_TIMESTAMP[j]) && (BD2$X_TIMESTAMP[j]<BD$End.Date[i]) && (BD$Linea[i]==BD2$Linea[j]))
{
vl = BD2$X_VALUE[j]
BD$vec[i] = paste(BD$vec[i],vl,sep="")
#vect = as.numeric(BD$vec[i])
BD$Sum[i] = BD$Sum[i]+ BD2$X_VALUE[j]
#BD$Average[i] = mean(vect)
}
}
}
About data I attached a photo.This is for BD.
And this is for BD2:
I tried to create vec before the for sentence. But still it doesn't work.
The error is always the same: Error in $<-.data.frame(*tmp*, "vec", value = list(NULL, NULL, NULL, : replacement has 47 rows, data has 530
You may be able to use a combination of dplyr and purrr to get your summaries. You could split() your data.frame BD into a list of rows. Then you can use purrr:map() to work with each element of the list (i.e. row in your data) to find matches in BD2 and calculate some statistics. After you have your stats, you can recombine back into a single data.frame with bind_rows():
library(dplyr)
library(purrr)
# create a small data example
BD <- data.frame(Start.Date = 1:5,
End.Date = 2:6,
Linea = c(1, 1, 3, 1, 1))
BD2 <- data.frame(X_TIMESTAMP = c(1.5, 1.5, 2.5, 3.5, 4.5),
X_VALUE = c(31, 4, 60, 20, -11),
Linea = c(1,1,1,1,1))
BD %>%
split(seq(nrow(.))) %>%
map(function(BD_row_i) {
# find the timestamps that overlap in BD2
BD2_matches <- BD2 %>%
filter((BD_row_i$Start.Date < X_TIMESTAMP) &
(BD_row_i$End.Date > X_TIMESTAMP) &
(BD_row_i$Linea == Linea))
if (nrow(BD2_matches) > 0) {
# calculate some stats based on these matches
BD2_matches <- BD2_matches %>%
summarize(Sum = sum(X_VALUE),
Average = mean(X_VALUE),
Median = median(X_VALUE),
Sd = sd(X_VALUE))
} else {
# if no matches, return NAs (blank)
BD2_matches <- BD2_matches %>%
summarize(Sum = NA,
Average = NA,
Median = NA,
Sd = NA)
}
# add these statistics back to your BD_row_i data.frame
BD_row_i <- cbind.data.frame(BD_row_i, BD2_matches)
# return the updated data frame
return(BD_row_i)
}) %>%
bind_rows()
Start.Date End.Date Linea Sum Average Median Sd
1 1 2 1 35 17.5 17.5 19.09188
2 2 3 1 60 60.0 60.0 NA
3 3 4 3 NA NA NA NA
4 4 5 1 -11 -11.0 -11.0 NA
5 5 6 1 NA NA NA NA

Using barplot in R studio

when I try this code for barplot (L$neighbourhood is the apartment neighbourhood in Paris for example, Champs-Elysées, Batignolles, which is string data, and L$price is the numeric data for apartment price).
barplot(L$neighbourhood, L$price, main = "TITLE", xlab = "Neighbourhood", ylab = "Price")
But, I get an error:
Error in barplot.default(L$neighbourhood, L$price, main = "TITLE",
xlab = "Neighbourhood", : 'height' must be a vector or a matrix
We cannot use string data as an input in barplot function in R? How can I fix this error please?
allneighbourhoods
Quite unclear what you want to barplot. Let's assume you want to see the average price per neighborhood. If that's what you're after you can proceed like this.
First some illustrative data:
set.seed(123)
Neighborhood <- sample(LETTERS[1:4], 10, replace = T)
Price <- sample(10:100, 10, replace = T)
df <- data.frame(Neighborhood, Price)
df
Neighborhood Price
1 C 23
2 C 34
3 C 99
4 B 100
5 C 78
6 B 100
7 B 66
8 B 18
9 C 81
10 A 35
Now compute the averages by neighborhood using the function aggregate and store the result in a new dataframe:
df_new <- aggregate(x = df$Price, by = list(df$Neighborhood), function(x) mean(x))
df_new
Group.1 x
1 A 35
2 B 71
3 C 63
And finally you can plot the average prices in variable x and add the neighborhood names from the Group.1column:
barplot(df_new$x, names.arg = df_new$Group.1)
An even simpler solution is this, using tapplyand mean:
df_new <- tapply(df$Price, df$Neighborhood, mean)
barplot(df_new, names.arg = names(df_new))

Grouped moving average with new time window in R

A minimal example would be:
set.seed(42)
df <- data.frame(ID = rep("P1", 100),
treatment = c(rep("A", 50), rep("B", 50)),
t = rep(seq(1:50), 2),
x = rnorm(n = 100))
which, let's imagine, is some value measured each second. Now I would also want to have the average of each 20 second window. so 1-20, 21-40, ... (i.e. no overlap)
My actual data has a column ID representing multiple subjects, treatment with labels for two treatments and the actual time column is in increments of 5 (0, 5, 10, 15, 20). There are other important grouping columns. The values I have are from EEG recordings and I need to increase the size of window for some analyses.
What would be cleanest way to do this? Note that there must be a way of knowing which time-window the calculated average relates to (just 20, 40 would be enough, not a string with 1-20).
EDIT:
based on a now deleted comment I got here
df %>%
mutate(timeWin = ceiling(.$t/20)*20) %>%
group_by(ID, treatment, timeWin) %>%
summarise(xAvg = mean(x))
Only issue with that bit of code is the rather crude use of ceiling. In the example above the 40-50 bracket will be printed as 60.
df$grouped_time = ave(df$t, ceiling(df$t/20), FUN = max)
aggregate(df["x"], df[c("ID", "treatment", "grouped_time")], mean)
# ID treatment grouped_time x
#1 P1 A 20 0.19192002
#2 P1 B 20 0.27873536
#3 P1 A 40 -0.27099180
#4 P1 B 40 0.01661547
#5 P1 A 50 -0.02021535
#6 P1 B 50 -0.08719458

Calculate the trend for 2000-rows time series and isolate the abnormal rows

I have a R dataframe which describes the evolution of the sales of a product in approx. 2000 shops in a quarterly basis, with 5 columns (ie. 5 periods of time). I'd like to know how to analyse it with R.
I've already tried to make some basic analysis, that is to say to determine the average sales for the 1st period, the 2nd period, etc. and then determine the average for each period and then to compare the evolution of each shop relatively to this general evolution. For instance, there is a total of 50 000 sales for the 1st period and 35 000 for the 5th, so I assume that for each shop the normal sale in the 5th period is to be 35/55=0.63*the amount of the 1st period's sale: if the shop X has sold 100 items in the first period, I assume that it should normally sell 63 items in the 5th period.
Obviously, this is an easy-to-do method, but it is not statistically relevant.
I would like a method which would enable me to determine a trend curb which miminizes my R-square. My objective is to be able to analyse the sales of the shops by neutralizing the general trend: I'd like to know precisely what are the underperforming shops and what are the overperforming shops, with a statistically correct approach.
My dataframe is structured in this way :
shopID | sum | qt1 | qt2 | qt3 | qt4 | qt5
000001 | 150 | 45 | 15 | 40 | 25 | 25
000002 | 100 | 20 | 20 | 20 | 20 | 20
000003 | 500 | 200 | 0 | 100 | 100 | 100
... (2200 rows)
I've tried to put my timeserie in a list, which is successful, with the following functon:
reversesales=t(data.frame(sales$qt1,sales$qt2,sales$qt3,sales$qt4,sales$qt5))
# I reverse rows and columns of the frame in order that the time periods be the rows
timeser<-ts(reversesales,start=1,end=5, deltat=1/4)
# deltat=1/4 because it is a quarterly basis, 1 and 5 because I have 5 quarters
Still, I am unable to do anything with this variable. I can't do any plot (with the "plot" function) as there are 2200 rows (and so R wants to make me 2200 successive plots, obviously this is not what I want).
In addition, I don't know how to determine the theoretical trend and the theoretical value of the sales for each period for each shop...
Thank you for your help! (and merry Christmas)
An implementation of mixed model:
install.packages("nlme")
library("nlme")
library(dplyr)
# Generating some data with a structure like yours:
start <- round(sample(10:100, 50, replace = TRUE)*runif(50))
df <- data_frame(shopID = 1:50, qt1 = start, qt2 =round(qt1*runif(50, .5, 2)) ,qt3 = round(qt2*runif(50, .5, 2)), qt4 = round(qt3*runif(50, .5, 2)), qt5 = round(qt4*runif(50, .5, 2)))
df <- as.data.frame(df)
# Converting in into the long format:
df <- reshape(df, idvar = "shopID", varying = names(df)[-1], direction = "long", sep = "")
Estimating the model:
mod <- lme(qt ~ time, random = ~ time | shopID, data = df)
# Extract the random effects for comparison:
random.effects(mod)
(Intercept) time
1 74.0790805 3.7034172
2 7.8713699 4.2138001
3 -8.0670810 -5.8754060
4 -16.5114428 16.4920663
5 -16.7098229 6.4685228
6 -11.9630688 -8.0411504
7 -12.9669777 21.3071366
8 -24.1099280 32.9274361
9 8.5107335 -9.7976905
10 -13.2707679 -6.6028927
11 3.6206163 -4.1017784
12 21.2342886 -6.7120725
13 -14.6489512 11.6847109
14 -14.7291647 2.1365768
15 10.6791941 3.2097199
16 -14.1524187 -1.6933291
17 5.2120647 8.0119320
18 -2.5172933 -6.5011416
19 -9.0094366 -5.6031271
20 1.4857512 -5.9913865
21 -16.5973442 3.5164298
22 -26.7724763 27.9264081
23 49.0764631 -12.9800871
24 -0.1512509 2.3589947
25 15.7723150 -7.9295698
26 2.1955489 11.0318875
27 -8.0890346 -5.4145977
28 0.1338790 -8.3551182
29 9.7113758 -9.5799588
30 -6.0257683 42.3140432
31 -15.7655545 -8.6226255
32 -4.1450984 18.7995079
33 4.1510104 -1.6384103
34 2.5107652 -2.0871890
35 -23.8640815 7.6680185
36 -10.8228653 -7.7370976
37 -14.1253093 -8.1738468
38 42.4114024 -9.0436585
39 -10.7453627 2.4590883
40 -12.0947901 -5.2763010
41 -7.6578305 -7.9630013
42 -14.9985612 -0.4848326
43 -13.4081771 -7.2655456
44 -11.5646620 -7.5365387
45 6.9116844 -10.5200339
46 70.7785492 -11.5522014
47 -7.3556367 -8.3946072
48 27.3830419 -6.9049164
49 14.3188079 -9.9334156
50 -15.2077850 -7.9161690
I would interpret the values as follows: consider them as a deviation from zero, so that positive values are positive deviations from the average, whereas negative values are negative deviation from the average. The averages of the two columns are zero, as is checked below:
round(apply(random.effects(mod), 2, mean))
(Intercept) time
0 0
library(zoo)
#Reconstructing the data with four quarter columns (instead of five quarters as in your example)
shopID <- c(1, 2, 3, 4, 5)
sum <- c(150, 100, 500, 350, 50)
qt1 <- c(40, 10, 130, 50, 10)
qt2 <- c(40, 40, 110, 100, 15)
qt3 <- c(50, 30, 140, 150, 10)
qt4 <- c(20, 20, 120, 50, 15)
myDF <- data.frame(shopID, sum, qt1, qt2, qt3, qt4)
#The ts() function converts a numeric vector into an R time series object
ts1 <- ts(as.numeric((myDF[1,3:6])), frequency=4)
ts2 <- ts(as.numeric((myDF[2,3:6])), frequency=4)
ts3 <- ts(as.numeric((myDF[3,3:6])), frequency=4)
ts4 <- ts(as.numeric((myDF[4,3:6])), frequency=4)
ts5 <- ts(as.numeric((myDF[5,3:6])), frequency=4)
#Merge time series objects
tsm <- merge(a = as.zoo(ts1), b = as.zoo(ts2), c = as.zoo(ts3), d = as.zoo(ts4), e = as.zoo(ts5))
#Plotting the Time Series
plot.ts(tsm, plot.type = "single", lty = 1:5, xlab = "Time", ylab = "Sales")
The code is not optimized, and can be improved. More about time series analysis can be read here. Hope this gives some direction.

Using R to remove data which is below a quartile threshold

I am creating correlations using R, with the following code:
Values<-read.csv(inputFile, header = TRUE)
O<-Values$Abundance_O
S<-Values$Abundance_S
cor(O,S)
pear_cor<-round(cor(O,S),4)
outfile<-paste(inputFile, ".jpg", sep = "")
jpeg(filename = outfile, width = 15, height = 10, units = "in", pointsize = 10, quality = 75, bg = "white", res = 300, restoreConsole = TRUE)
rx<-range(0,20000000)
ry<-range(0,200000)
plot(rx,ry, ylab="S", xlab="O", main="O vs S", type="n")
points(O,S, col="black", pch=3, lwd=1)
mtext(sprintf("%s %.4f", "pearson: ", pear_cor), adj=1, padj=0, side = 1, line = 4)
dev.off()
pear_cor
I now need to find the lower quartile for each set of data and exclude data that is within the lower quartile. I would then like to rewrite the data without those values and use the new column of data in the correlation analysis (because I want to threshold the data by the lower quartile). If there is a way I can write this so that it is easy to change the threshold by applying arguments from Java (as I have with the input file name) that's even better!
Thank you so much.
I have now implicated the answer below and that is working, however I need to keep the pairs of data together for the correlation. Here is an example of my data (from csv):
Abundance_O Abundance_S
3635900.752 1390.883073
463299.4622 1470.92626
359101.0482 989.1609251
284966.6421 3248.832403
415283.663 2492.231265
2076456.856 10175.48946
620286.6206 5074.268802
3709754.717 269.6856808
803321.0892 118.2935093
411553.0203 4772.499758
50626.83554 17.29893001
337428.8939 203.3536852
42046.61549 152.1321255
1372013.047 5436.783169
939106.3275 7080.770535
96618.01393 1967.834701
229045.6983 948.3087208
4419414.018 23735.19352
So I need to exclude both values in the row if one does not meet my quartile threshold (0.25 quartile). So if the quartile for O was 45000 then the row "42046.61549,152.1321255" would be removed. Is this possible? If I read in both columns as a dataframe can I search each column separately? Or find the quartiles and then input that value into code to remove the appropriate rows?
Thanks again, and sorry for the evolution of the question!
Please try to provide a reproducible example, but if you have data in a data.frame, you can subset it using the quantile function as the logical test. For instance, in the following data we want to select only rows from the dataframe where the value of the measured variable 'Val' is above the bottom quartile:
# set.seed so you can reproduce these values exactly on your system
set.seed(39856)
df <- data.frame( ID = 1:10 , Val = runif(10) )
df
ID Val
1 1 0.76487516
2 2 0.59755578
3 3 0.94584374
4 4 0.72179297
5 5 0.04513418
6 6 0.95772248
7 7 0.14566118
8 8 0.84898704
9 9 0.07246594
10 10 0.14136138
# Now to select only rows where the value of our measured variable 'Val' is above the bottom 25% quartile
df[ df$Val > quantile(df$Val , 0.25 ) , ]
ID Val
1 1 0.7648752
2 2 0.5975558
3 3 0.9458437
4 4 0.7217930
6 6 0.9577225
7 7 0.1456612
8 8 0.8489870
# And check the value of the bottom 25% quantile...
quantile(df$Val , 0.25 )
25%
0.1424363
Although this is an old question, I came across it during research of my own and I arrived at a solution that someone may be interested in.
I first defined a function which will convert a numerical vector into its quantile groups. Parameter n determines the quantile length (n = 4 for quartiles, n = 10 for deciles).
qgroup = function(numvec, n = 4){
qtile = quantile(numvec, probs = seq(0, 1, 1/n))
out = sapply(numvec, function(x) sum(x >= qtile[-(n+1)]))
return(out)
}
Function example:
v = rep(1:20)
> qgroup(v)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Consider now the following data:
dt = data.table(
A0 = runif(100),
A1 = runif(100)
)
We apply qgroup() across the data to obtain two quartile group columns:
cols = colnames(dt)
qcols = c('Q0', 'Q1')
dt[, (qcols) := lapply(.SD, qgroup), .SDcols = cols]
head(dt)
> A0 A1 Q0 Q1
1: 0.72121846 0.1908863 3 1
2: 0.70373594 0.4389152 3 2
3: 0.04604934 0.5301261 1 3
4: 0.10476643 0.1108709 1 1
5: 0.76907762 0.4913463 4 2
6: 0.38265848 0.9291649 2 4
Lastly, we only include rows for which both quartile groups are above the first quartile:
dt = dt[Q0 + Q1 > 2]

Resources