R data frame, sampling with replacement while controling for two variables - r

I have the following data frame in R, with three variables:
id<-c(1,2,3,4,5,6,7,8,9,10)
frequency<-c(1,2,3,4,5,6,7,8,9,10)
male<-c(1,0,1,0,1,0,1,0,1,0)
df<-data.frame(id,frequency,male)
For df mean frequency is 5.5 and 50% of observations are male. Now I want to take a random sample with replacement from df and with the same size, while mean frequency of the new sample is 4 and male's proportion remains constant.
I wonder if there is any way to do such thing in R.
Thanks in advance.

I cannot find any particular function for what you want. But it will give the results you want. The combination of 'repeat' and if function play the same role as while loop, and other line means do sampling size of 4.
repeat
{
df.sample = df[sample(nrow(df),size=4,replace=FALSE),]
if(mean(df.sample$frequency) == 4.5 & mean(df.sample$male) == 0.5){
break
}
}
The results is
> df.sample
id frequency male
4 4 4 0
2 2 2 0
9 9 9 1
3 3 3 1
For while loop,
while(!(mean(df.sample$frequency) == 4.5 & mean(df.sample$male) == 0.5)){
df.sample = df[sample(nrow(df),size=4,replace=FALSE),]
}

Related

Adding specific column value according to row value

I have a dataframe h3
Genotype Preference
Rice 1
Rice 2
Lr 3
Lr 3
th 4
th 7
I want the dataframe to look like
Genotype Preference Haplotype
Rice 1 1
Rice 2 1
Lr 3 2
Lr 3 2
th 4 0.5
th 7 0.5
That is I want to add a numerical variable to be added to the each type of genotype. I have around 100 observations for each type of genotype. I want to be able to add the numberical variable into a new column in a single line of code and ensure that 1 is added corresponding to rice, 2 to Lr and 0.5 to th.
I tried constructing the code with the mutate/ifelse:
h3 %>% select(Genotype) %>% mutate(Type = ifelse (Genotype = c("Rice"), 1, Genotype))
Other results which I looked up, provide solutions for adding a column with a calculated value from the previous columns but not specific values.
I have found this dplyr mutate with conditional values and Apply R-function to rows depending on value in other column but dont know how to modify it for my code.
Any help with this will be greatly appreciated.
Using dplyr, you can do:
library(dplyr)
df %>% mutate(Haplotype = ifelse(Genotype == "Rice",1,ifelse(Genotype == "Lr",2,0.5)))
Using base, you can do the same thing:
df$Haplotype = ifelse(df$Genotype == "Rice",1,ifelse(df$Genotype == "Lr",2,0.5))
Data
df = data.frame("Genotype" = c("Rice","Rice","Lr","Lr","Th","Th"),
"Preference" = c(1,2,3,3,4,7))

Proportion across rows and specific columns in R

I want to get a proportion of pcr detection and create a new column using the following test dataset. I want a proportion of detection for each row, only in the columns pcr1 to pcr6. I want the other columns to be ignored.
site sample pcr1 pcr2 pcr3 pcr4 pcr5 pcr6
pond 1 1 1 1 1 0 1 1
pond 1 2 1 1 0 0 1 1
pond 1 3 0 0 1 1 1 1
I want the output to create a new column with the proportion detection. The dataset above is only a small sample of the one I am using. I've tried:
data$detection.proportion <- rowMeans(subset(testdf, select = c(pcr1, pcr2, pcr3, pcr4, pcr5, pcr6)), na.rm = TRUE)
This works for this small dataset but I tried on my larger one and it did not work and it would give the incorrect proportions. What I'm looking for is a way to count all the 1s from pcr1 to pcr6 and divide them by the total number of 1s and 0s (which I know is 6 but I would like R to recognize this in case it's not inputted).
I found a way to do it in case anyone else needed to. I don't know if this is the most effective but it worked for me.
data$detection.proportion <- length(subset(testdf, select = c(pcr1, pcr2, pcr3, pcr4, pcr5, pcr6)))
#Calculates the length of pcrs = 6
p.detection <- rowSums(data[,c(-1, -2, -3, -10)] == "1")
#Counts the occurrences of 1 (which is detection) for each row
data$detection.proportion <- p.detection/data$detection.proportion
#Divides the occurrences by the total number of pcrs to find the detected proportion.

Complex data calculation for consecutive zeros at row level in R (lag v/s lead)

I have a complex calculation that needs to be done. It is basically at a row level, and i am not sure how to tackle the same.
If you can help me with the approach or any functions, that would be really great.
I will break my problem into two sub-problems for simplicity.
Below is how my data looks like
Group,Date,Month,Sales,lag7,lag6,lag5,lag4,lag3,lag2,lag1,lag0(reference),lead1,lead2,lead3,lead4,lead5,lead6,lead7
Group1,42005,1,2503,1,1,0,0,0,0,0,0,0,0,0,0,1,0,1
Group1,42036,2,3734,1,1,1,1,1,0,0,0,0,1,1,0,0,0,0
Group1,42064,3,6631,1,0,0,1,0,0,0,0,0,0,1,1,1,1,0
Group1,42095,4,8606,0,1,0,1,1,0,1,0,1,1,1,0,0,0,0
Group1,42125,5,1889,0,1,1,0,1,0,0,0,0,0,0,0,1,1,0
Group1,42156,6,4819,0,1,0,0,0,1,0,0,1,0,1,1,1,1,0
Group1,42186,7,5120,0,0,1,1,1,1,1,0,0,1,1,0,1,1,0
I have data for each Group at Monthly Level.
I would like to capture the below two things.
1. The count of consecutive zeros for each row to-and-fro from lag0(reference)
The highlighted yellow are the cases, that are consecutive with lag0(reference) to a certain point, that it reaches first 1. I want to capture the count of zero's at row level, along with the corresponding Sales value.
Below is the output i am looking for the part1.
Output:
Month,Sales,Count
1,2503,9
2,3734,3
3,6631,5
4,8606,0
5,1889,6
6,4819,1
7,5120,1
2. Identify the consecutive rows(row:1,2 and 3 & similarly row:5,6) where overlap of any lag or lead happens for any 0 within the lag0(reference range), and capture their Sales and Month value.
For example, for row 1,2 and 3, the overlap happens at atleast lag:3,2,1 &
lead: 1,2, this needs to be captured and tagged as case1 (or 1). Similarly, for row 5 and 6 atleast lag1 is overlapping, hence this needs to be captured, and tagged as Case2(or 2), along with Sales and Month value.
Now, row 7 is not overlapping with the previous or later consecutive row,hence it will not be captured.
Below is the result i am looking for part2.
Month,Sales,Case
1,2503,1
2,3734,1
3,6631,1
5,1889,2
6,4819,2
I want to run this for multiple groups, hence i will either incorporate dplyr or loop to get the result. Currently, i am simply looking for the approach.
Not sure how to solve this problem. First time i am looking to capture things at row level in R. I am not looking for any solution. Simply looking for a first step to counter this problem. Would appreciate any leads.
An option using rle for the 1st part of the calculation can be as:
df$count <- apply(df[,-c(1:4)],1,function(x){
first <- rle(x[1:7])
second <- rle(x[9:15])
count <- 0
if(first$values[length(first$values)] == 0){
count = first$lengths[length(first$values)]
}
if(second$values[1] == 0){
count = count+second$lengths[1]
}
count
})
df[,c("Month", "Sales", "count")]
# Month Sales count
# 1 1 2503 9
# 2 2 3734 3
# 3 3 6631 5
# 4 4 8606 0
# 5 5 1889 6
# 6 6 4819 1
# 7 7 5120 1
Data:
df <- read.table(text =
"Group,Date,Month,Sales,lag7,lag6,lag5,lag4,lag3,lag2,lag1,lag0(reference),lead1,lead2,lead3,lead4,lead5,lead6,lead7
Group1,42005,1,2503,1,1,0,0,0,0,0,0,0,0,0,0,1,0,1
Group1,42036,2,3734,1,1,1,1,1,0,0,0,0,1,1,0,0,0,0
Group1,42064,3,6631,1,0,0,1,0,0,0,0,0,0,1,1,1,1,0
Group1,42095,4,8606,0,1,0,1,1,0,1,0,1,1,1,0,0,0,0
Group1,42125,5,1889,0,1,1,0,1,0,0,0,0,0,0,0,1,1,0
Group1,42156,6,4819,0,1,0,0,0,1,0,0,1,0,1,1,1,1,0
Group1,42186,7,5120,0,0,1,1,1,1,1,0,0,1,1,0,1,1,0",
header = TRUE, stringsAsFactors = FALSE, sep = ",")

Split data frame based into ntiles based on value that is equal to sum of rows divided by the number of ntiles we want

I have a data frame with about 45k points with 3 columns - weight, persons and population. Population is weight*persons. I want to be able to split the data frame into ntiles(deciles, centiles etc) based on need. The data frame has to be split in a way that there are same number of population points in each ntile.
Which means, the data frame needs to be split at value = sum(population)/ntile. So for example if ntile = 10, then, sum(population)/10 = a. Next I need to add up row values in population column till sum = a, split at that point and continue this until I have run through all the 45K points. A sample of data is below.
weight persons population
1 3687.926 9 33191.337
2 3687.926 16 59006.8217
3 3687.926 7 25815.4847
4 4420.088 5 22100.447
5 4420.088 7 30940.6167
6 4420.088 6 26520.5287
7 3687.926 15 55318.8927
8 3687.926 9 33191.3357
9 3687.926 6 22127.5577
10 4452.829 8 35622.6367
11 4452.829 3 13358.4887
12 4452.829 4 17811.3187
I have been trying to use loops. I am stuck on splitting the data frame into the n splits needed. I an new to R. So any help is appreciated.
x= df$population
break_point = sum(x)/10
ntile_points = 0
for(i in 1:length(x))
{
while(ntile_points != break_point)
{
ntile_points = ntile_points+x[i]
}
}
I'm not sure that's what you want, note that your quantile is not necessary an integer, you should substract between each break point :
ntile=10
df=cbind(df,cumsum(df$population))
names(df)[ncol(df)]='Cumsum'
s=seq(0,sum(df$population),sum(df$population)/ntile)
subdfs=list()
for (i in 2:length(s)){
subdfs=c(subdfs,list(df[intersect(which(df$Cumsum<=s[i]),which(df$Cumsum>s[i-1])),]))
}
Then subdfs is a list which contains 10 data frames split as you wanted. Call the first data frame with subdfs[[1]] and so on. Maybe I did not understand what you want, tell me.
In this way the first df contain all the first values until the cumulate sum of the population stays in the interaval ]0,sum(population)/10], the second contains, the following values where the cumulate sum of the population is in the interval ]sum(population)/10,2*sum(population)/10], etc....
Is that what you wanted ?

Summary in R for frequency tables?

I have a set of user recommandations
review=matrix(c(5:1,10,2,1,1,2), nrow=5, ncol=2, dimnames=list(NULL,c("Star","Votes")))
and wanted to use summary(review) to show basic properties mean, median, quartiles and min max.
But it gives back the summary of both columns. I refrain from using data.frame because the factors 'Star' are ordered.
How can I tell R that Star is a ordered list of factors numeric score and votes are their frequency?
I'm not exactly sure what you mean by taking the mean in general if Star is supposed to be an ordered factor. However, in the example you give where Star is actually a set of numeric values, you can use the following:
library(Hmisc)
R> review=matrix(c(5:1,10,2,1,1,2), nrow=5, ncol=2, dimnames=list(NULL,c("Star","Votes")))
R> wtd.mean(review[, 1], weights = review[, 2])
[1] 4.0625
R> wtd.quantile(review[, 1], weights = review[, 2])
0% 25% 50% 75% 100%
1.00 3.75 5.00 5.00 5.00
I don't understand what's the problem. Why shouldn't you use data.frame?
rv <- data.frame(star = ordered(review[, 1]), votes = review[, 2])
You should convert your data.frame to vector:
( vts <- with(rv, rep(star, votes)) )
[1] 5 5 5 5 5 5 5 5 5 5 4 4 3 2 1 1
Levels: 1 < 2 < 3 < 4 < 5
Then do the summary... I just don't know what kind of summary, since summary will bring you back to the start. O_o
summary(vts)
1 2 3 4 5
2 1 1 2 10
EDIT (on #Prasad's suggestion)
Since vts is an ordered factor, you should convert it to numeric, hence calculate the summary (at this moment I will disregard the background statistical issues):
nvts <- as.numeric(levels(vts)[vts]) ## numeric conversion
summary(nvts) ## "ordinary" summary
fivenum(nvts) ## Tukey's five number summary
Just to clarify -- when you say you would like "mean, median, quartiles and min/max", you're talking in terms of number of stars? e.g mean = 4.062 stars?
Then using aL3xa's code, would something like summary(as.numeric(as.character(vts))) be what you want?

Resources