Random number generation but common within group - r

Is there a way to generate random numbers from a distribution such that these numbers are common for rows within a group? Within an unbalanced panel, there is a household_id variable according to which I want to generate random numbers from truncated normal distribution using rtruncnorm.
Thank you.

Household_id Random number
1 0.6
1 0.6
1 0.6
2 0.1
3 0.9
3 0.9
4 0.2
5 0.7
6 0.3
6 0.3
So, the household_id is for identifying the household in this unbalanced panel. Now, I want to generate random numbers using rtruncnorm such as shown, they are the same for within household cells.
Thank you

Related

How to reference multiple dataframe columns to calculate a new column of weighted averages in R

I am currently calculating the weighted average column for my dataframe through manual referencing of each column name. Is there a way to shorten the code by multiplying sets of arrays
eg:
df[,c(A,B,C)] and df[,c(PerA,PerB,PerC)] to obtain the weighted average, like the SUMPRODUCT in Excel? Especially if I have multiple input columns to calculate the weighted average column
df$WtAvg = df$A*dfPerA + df$B*df$PerB + df$C*df$PerC
Without transforming your dataframe and assuming that first half of the dataframe is the size and the second half is the weight, you can use weighted.mean function in apply function:
df$WtAvg = apply(df,1,function(x){weighted.mean(x[1:(ncol(df)/2)],
x[(ncol(df)/2+1):ncol(df)])})
And you get the following output:
> df
A B C PerA PerB PerC WtAvg
1 1 2 3 0.1 0.2 0.7 2.6
2 4 5 6 0.5 0.3 0.2 4.7
3 7 8 9 0.6 0.1 0.3 7.7

How to calculate the average of experimental values of replicates of the same sample, without knowing the number of replicates ahead?

I have a csv file with a data set of experimental values of many samples, and sometimes replicates of the same sample. For the replicates I only take into account the mean value of the replicates belonging to the same sample. The problem is, the number of replicates varies, it can be 2, 3, 4 etc...
My code isn't right, because it should be only working if replicates number is 2 (since I am using a loop to compare one sampleID to the previous sampleID in the loop). Plus, my code doesn't work, it adds the same average value to all my samples, which is not right. I think there is a problem at the start of the loop too. Because when x=1, x-1=0 which doesn't correspond to any value, so that may cause the code to not work?
I am a beginner in R, I never had any courses or training I am training to learn it by myself, so thank you in advance for your help.
My dataset:
Expected output:
PS: in this example the replicates number is 2. However, it can be different depending on samples, sometimes its 2, sometimes 3, 4 etc...
for (x in length(dat$Sample)){
if (dat$Sample[x]==dat$Sample[x-1]){
dat$Average.OD[x-1] <- mean(dat$OD[x], dat$OD[x-1])
dat$Average.OD[x] <- NA
}
}
Let me show you the possible solution by data.table.
#Data
data <- data.frame('Sample'=c('Blank','Blank','STD1','STD1'),
'OD'=c(0.07,0.08,0.09,0.10))
#Code
#Converting our data to data.table.
setDT(data)
#Finding the average of OD by Sample Column. Here Sample Column is the key.If you want it by both Sample and Replicates, pass both of them in by and so on.
data[, AverageOD := mean(OD), by = c("Sample")]
#Turning all the duplicate AverageOD values to NA.
data[duplicated(data, by = c("Sample")), AverageOD := NA]
#Turning column name of AverageOD to Average OD
names(data)[which(names(data) == "AverageOD")] = 'Average OD'
Let me know if you have any questions.
You can do this without any looping using aggregate and merge. Since you do not provide any data, I illustrate with a simple example.
## Example data
set.seed(123)
Sample = round(runif(10), 1)
OD = sample(4, 10, replace=T)
dat = data.frame(OD, Sample)
Means = aggregate(dat$Sample, list(dat$OD), mean, na.rm=T)
names(Means) = c("OD", "mean")
Means
OD mean
1 1 0.9000000
2 2 0.7000000
3 3 0.3666667
4 4 0.4000000
merge(dat, Means, "OD")
OD Sample mean
1 1 0.9 0.9000000
2 1 0.9 0.9000000
3 2 0.8 0.7000000
4 2 0.9 0.7000000
5 2 0.4 0.7000000
6 3 0.0 0.3666667
7 3 0.6 0.3666667
8 3 0.5 0.3666667
9 4 0.3 0.4000000
10 4 0.5 0.4000000

R. Using t-test, compare individual mean with global mean

I have a huge matrix of this form, with 1000000 rows and 10000 columns. This is a toy example:
A B C Mean
1 3 4 2.66
2 4 3 3
1 3 4 2.66
9 9 9 9
1 3 2 2
2 4 5 3
1 2 6 3
2 3 5 3.33
The rows in column "Mean" represent the mean of A, B and C for each row. On the other hand, the global mean of column "Mean" is 3.58. I would like to know, using a t-test and R, whether the mean in each row is significantly higher from the global mean. How can I get the p-values for comparison?. Comparing means between 2 groups is very simple using t.test(), but I am not able to find how to compare a single value with the mean of a group that includes that value.
I strongly agree with Roman that you should go back to CV, since this seems liable to giving you a number of false positives.
But in terms of your R question, you could try a one-sample t-test here:
global.mean <- 3.58
val.matrix <- matrix(c(...),...)
pvals <- apply(val.matrix,1,function(r) t.test(r,mu=global.mean)$p.value)
### should do a multiple comparison correction here, e.g., pvals*nrow(val.matrix)
This will give you a vector of size nrow(val.matrix) with each element being the p-value from the two-sided t-test testing whether the values of a row are
significantly different from 3.58. I'm not advocating for this statistical approach, but this is how you could implement it.

R - conditional cumsum using multiple columns

I'm new to stackoverflow So I hope I post my question in the right format. I have a test dataset with three columns where rank is the ranking of a cell, Esvalue is the value of a cell and zoneID is an area identifier(Note! in the real dataset I have up to 40.000 zoneIDs)
rank<-seq(0.1,1,0.1)
Esvalue<-seq(10,1)
zoneID<-rep(seq.int(1,2),times=5)
rank Esvalue zoneID
0.1 10 1
0.2 9 2
0.3 8 1
0.4 7 2
0.5 6 1
0.6 5 2
0.7 4 1
0.8 3 2
0.9 2 1
1.0 1 2
I want to calculate the following:
% ES value <- For each rank, including all lower ranks, the cumulative % share of the total ES value relative to the ES value of all zones
cumsum(df$Esvalue)/sum(df$Esvalue)
% ES value zone <- For each rank, including all lower ranks, the cumulative % share of the total Esvalue relative to the ESvalue of a zoneID for each zone. I tried this now using mutate and using dplyr. Both so far only give me the cumulative sum, not the share. In the end this will generate a variable for each zoneID
df %>%
mutate(cA=cumsum(ifelse(!is.na(zoneID) & zoneID==1,Esvalue,0))) %>%
mutate(cB=cumsum(ifelse(!is.na(zoneID) & zoneID==2,Esvalue,0)))
These two variables I want to combine by
1) calculating the abs difference between the two for all the zoneIDs
2) for each rank calculate the mean of the absolute difference over all zoneIDs
In the end the final output should look like:
rank Esvalue zoneID mean_abs_diff
0.1 10 1 0.16666667
0.2 9 2 0.01333333
0.3 8 1 0.12000000
0.4 7 2 0.02000000
0.5 6 1 0.08000000
0.6 5 2 0.02000000
0.7 4 1 0.04666667
0.8 3 2 0.01333333
0.9 2 1 0.02000000
1.0 1 2 0.00000000
Now I created the last using some intermediate steps in Excel but my final dataset will be way too big to be handled by Excel. Any advice on how to proceed would be appreciated

Is there a package that I can use in order to get rules for a target outcome in R

For example In this given data set I would like to get the best values of each variable that will yield a pre-set value of "percentage" : for example I need that the value of "percentage" will be >=0.7 so in this case the outcome should be something like:
birds >=5,1<wolfs<=3 , 2<=snakes <=4
Example data set:
dat <- read.table(text = "birds wolfs snakes percentage
3 8 7 0.50
1 2 3 0.33
5 1 1 0.66
6 3 2 0.80
5 2 4 0.74",header = TRUE
I can't use decision trees as I have a large data frame and I can't see all tree correctly. I tried the *arules* package as but it requires that all variables will be factors and I have mixed dataset of factor,logical and continuous variables and I would like to keep the variables and the Independent variable continues .Also I need "percentage" variable to be the only one that I would like to optimize.
The code that I wrote with *arules* package is this:
library(arules)
dat$birds<-as.factor(dat$birds)
dat$wolfs<-as.factor(dat$wolfs)
dat$snakes<-as.factor(dat$snakes)
dat$percentage<-as.factor(dat$percentage)
rules<-apriori(dat, parameter = list(minlen=2, supp=0.005, conf=0.8))
Thank you
I may have misunderstood the question but to get the maximum value of each variable with the restriction of percentage >= 0.7 you could do this:
lapply(dat[dat$percentage >= 0.7, 1:3], max)
$birds
[1] 6
$wolfs
[1] 3
$snakes
[1] 4
Edit after comment:
So perhaps this is more what you are looking for:
> as.data.frame(lapply(dat[dat$percentage >= 0.7,1:3], function(y) c(min(y), max(y))))
birds wolfs snakes
1 5 2 2
2 6 3 4
It will give the min and max values representing the ranges of variables if percentage >=0.7
If this is completely missing what you are trying to achieve, I may not be the right person to help you.
Edit #2:
> as.data.frame(lapply(dat[dat$percentage >= 0.7,1:3], function(y) c(min(y), max(y), length(y), length(y)/nrow(dat))))
birds wolfs snakes
1 5.0 2.0 2.0
2 6.0 3.0 4.0
3 2.0 2.0 2.0
4 0.4 0.4 0.4
Row 1: min
Row 2: max
Row 3: number of observations meeting the condition
Row 4: percentage of observations meeting the condition (relative to total observations)

Resources