Proportion across rows and specific columns in R - r
I want to get a proportion of pcr detection and create a new column using the following test dataset. I want a proportion of detection for each row, only in the columns pcr1 to pcr6. I want the other columns to be ignored.
site sample pcr1 pcr2 pcr3 pcr4 pcr5 pcr6
pond 1 1 1 1 1 0 1 1
pond 1 2 1 1 0 0 1 1
pond 1 3 0 0 1 1 1 1
I want the output to create a new column with the proportion detection. The dataset above is only a small sample of the one I am using. I've tried:
data$detection.proportion <- rowMeans(subset(testdf, select = c(pcr1, pcr2, pcr3, pcr4, pcr5, pcr6)), na.rm = TRUE)
This works for this small dataset but I tried on my larger one and it did not work and it would give the incorrect proportions. What I'm looking for is a way to count all the 1s from pcr1 to pcr6 and divide them by the total number of 1s and 0s (which I know is 6 but I would like R to recognize this in case it's not inputted).
I found a way to do it in case anyone else needed to. I don't know if this is the most effective but it worked for me.
data$detection.proportion <- length(subset(testdf, select = c(pcr1, pcr2, pcr3, pcr4, pcr5, pcr6)))
#Calculates the length of pcrs = 6
p.detection <- rowSums(data[,c(-1, -2, -3, -10)] == "1")
#Counts the occurrences of 1 (which is detection) for each row
data$detection.proportion <- p.detection/data$detection.proportion
#Divides the occurrences by the total number of pcrs to find the detected proportion.
Related
Procedural way to generate signal combinations and their output in r
I have been continuing to learn r to transition away from excel and I am wondering what the best way to approach the following problem is, or at least what tools are available to me: I have a large data set (100K+ rows) and several columns that I could generate a signal off of and each value in the vectors can range between 0 and 3. sig1 sig2 sig3 sig4 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 0 1 2 2 0 1 2 2 0 1 1 2 0 1 1 2 I want to generate composite signals using the state of each cell in the four columns then see what each of the composite signals tell me about the returns in a time series. For this question the scope is only generating the combinations. So for example, one composite signal would be when all four cells in the vectors = 0. I could generate a new column that reads TRUE when that case is true and false in each other case, then go on to figure out how that effects the returns from the rest of the data frame. The thing is I want to check all combinations of the four columns, so 0000, 0001, 0002, 0003 and on and on, which is quite a few. With the extent of my knowledge of r, I only know how to do that by using mutate() for each combination and explicitly entering the condition to check. I assume there is a better way to do this, but I haven't found it yet. Thanks for the help!
I think that you could paste the columns together to get unique combinations, then just turn this to dummy variables: library(dplyr) library(dummies) # Create sample data data <- data.frame(sig1 = c(1,1,1,1,0,0,0), sig2 = c(1,1,0,0,0,1,1), sig3 = c(2,2,0,1,1,2,1)) # Paste together data <- data %>% mutate(sig_tot = paste0(sig1,sig2,sig3)) # Generate dummmies data <- cbind(data, dummy(data$sig_tot, sep = "_")) # Turn to logical if needed data <- data %>% mutate_at(vars(contains("data_")), as.logical) data
Complex data calculation for consecutive zeros at row level in R (lag v/s lead)
I have a complex calculation that needs to be done. It is basically at a row level, and i am not sure how to tackle the same. If you can help me with the approach or any functions, that would be really great. I will break my problem into two sub-problems for simplicity. Below is how my data looks like Group,Date,Month,Sales,lag7,lag6,lag5,lag4,lag3,lag2,lag1,lag0(reference),lead1,lead2,lead3,lead4,lead5,lead6,lead7 Group1,42005,1,2503,1,1,0,0,0,0,0,0,0,0,0,0,1,0,1 Group1,42036,2,3734,1,1,1,1,1,0,0,0,0,1,1,0,0,0,0 Group1,42064,3,6631,1,0,0,1,0,0,0,0,0,0,1,1,1,1,0 Group1,42095,4,8606,0,1,0,1,1,0,1,0,1,1,1,0,0,0,0 Group1,42125,5,1889,0,1,1,0,1,0,0,0,0,0,0,0,1,1,0 Group1,42156,6,4819,0,1,0,0,0,1,0,0,1,0,1,1,1,1,0 Group1,42186,7,5120,0,0,1,1,1,1,1,0,0,1,1,0,1,1,0 I have data for each Group at Monthly Level. I would like to capture the below two things. 1. The count of consecutive zeros for each row to-and-fro from lag0(reference) The highlighted yellow are the cases, that are consecutive with lag0(reference) to a certain point, that it reaches first 1. I want to capture the count of zero's at row level, along with the corresponding Sales value. Below is the output i am looking for the part1. Output: Month,Sales,Count 1,2503,9 2,3734,3 3,6631,5 4,8606,0 5,1889,6 6,4819,1 7,5120,1 2. Identify the consecutive rows(row:1,2 and 3 & similarly row:5,6) where overlap of any lag or lead happens for any 0 within the lag0(reference range), and capture their Sales and Month value. For example, for row 1,2 and 3, the overlap happens at atleast lag:3,2,1 & lead: 1,2, this needs to be captured and tagged as case1 (or 1). Similarly, for row 5 and 6 atleast lag1 is overlapping, hence this needs to be captured, and tagged as Case2(or 2), along with Sales and Month value. Now, row 7 is not overlapping with the previous or later consecutive row,hence it will not be captured. Below is the result i am looking for part2. Month,Sales,Case 1,2503,1 2,3734,1 3,6631,1 5,1889,2 6,4819,2 I want to run this for multiple groups, hence i will either incorporate dplyr or loop to get the result. Currently, i am simply looking for the approach. Not sure how to solve this problem. First time i am looking to capture things at row level in R. I am not looking for any solution. Simply looking for a first step to counter this problem. Would appreciate any leads.
An option using rle for the 1st part of the calculation can be as: df$count <- apply(df[,-c(1:4)],1,function(x){ first <- rle(x[1:7]) second <- rle(x[9:15]) count <- 0 if(first$values[length(first$values)] == 0){ count = first$lengths[length(first$values)] } if(second$values[1] == 0){ count = count+second$lengths[1] } count }) df[,c("Month", "Sales", "count")] # Month Sales count # 1 1 2503 9 # 2 2 3734 3 # 3 3 6631 5 # 4 4 8606 0 # 5 5 1889 6 # 6 6 4819 1 # 7 7 5120 1 Data: df <- read.table(text = "Group,Date,Month,Sales,lag7,lag6,lag5,lag4,lag3,lag2,lag1,lag0(reference),lead1,lead2,lead3,lead4,lead5,lead6,lead7 Group1,42005,1,2503,1,1,0,0,0,0,0,0,0,0,0,0,1,0,1 Group1,42036,2,3734,1,1,1,1,1,0,0,0,0,1,1,0,0,0,0 Group1,42064,3,6631,1,0,0,1,0,0,0,0,0,0,1,1,1,1,0 Group1,42095,4,8606,0,1,0,1,1,0,1,0,1,1,1,0,0,0,0 Group1,42125,5,1889,0,1,1,0,1,0,0,0,0,0,0,0,1,1,0 Group1,42156,6,4819,0,1,0,0,0,1,0,0,1,0,1,1,1,1,0 Group1,42186,7,5120,0,0,1,1,1,1,1,0,0,1,1,0,1,1,0", header = TRUE, stringsAsFactors = FALSE, sep = ",")
How to determine the uniqueness of each column values in its own dynamic range?
Assuming my dataframe has one column, I wish to add another column to indicate if my ith element is unique within the first i elements. The results I want is: c1 c2 1 1 2 1 3 1 2 0 1 0 For example, 1 is unique in {1}, 2 is unique in {1,2}, 3 is unique in {1,2,3}, 2 is not unique in {1,2,3,2}, 1 is not unique in {1,2,3,2,1}. Here is my code, but is runs extremely slow given I have nearly 1 million rows. for(i in 1:nrow(df)){ k <- sum(df$C1[1:i]==df$C1[i])) if(k>1){df[i,"C2"]=0} else{df[i,"C2"]=1} } Is there a quicker way of achieving this?
The following works: x$c2 = as.numeric(! duplicated(x$c1)) Or, if you prefer more explicit code (I do, but it’s slower in this case): x$c2 = ifelse(duplicated(x$c1), 0, 1)
Mutate Cumsum with Previous Row Value
I am trying to run a cumsum on a data frame on two separate columns. They are essentially tabulation of events for two different variables. Only one variable can have an event recorded per row in the data frame. The way I attacked the problem was to create a new variable, holding the value ‘1’, and create two new columns to sum the variables totals. This works fine, and I can get the correct total amount of occurrences, but the problem I am having is that in my current ifelse statement, if the event recorded is for variable “A”, then variable “B” is assigned 0. But, for every row, I want to have the previous variable’s value assigned to the current row, so that I don’t end up with gaps where it goes from 1 to 2, to 0, to 3. I don't want to run summarize on this either, I would prefer to keep each recorded instance and run new columns through mutate. CURRENT DF: Event Value Variable Total.A Total.B 1 1 A 1 0 2 1 A 2 0 3 1 B 0 1 4 1 A 3 0 DESIRED RESULT: Event Value Variable Total.A Total.B 1 1 A 1 0 2 1 A 2 0 3 1 B 2 1 4 1 A 3 1 Thanks!
You can use the property of booleans that you can sum them as ones and zeroes. Therefore, you can use the cumsum-function: DF$Total.A <- cumsum(DF$variable=="A") Or as a more general approach, provided by #Frank you can do: uv = unique(as.character(DF$Variable)) DF[, paste0("Total.",uv)] <- lapply(uv, function(x) cumsum(DF$V == x))
If you have many levels to your factor, you can get this in one line by dummy coding and then cumsuming the matrix. X <- model.matrix(~Variable+0, DF) apply(X, 2, cumsum)
R data frame, sampling with replacement while controling for two variables
I have the following data frame in R, with three variables: id<-c(1,2,3,4,5,6,7,8,9,10) frequency<-c(1,2,3,4,5,6,7,8,9,10) male<-c(1,0,1,0,1,0,1,0,1,0) df<-data.frame(id,frequency,male) For df mean frequency is 5.5 and 50% of observations are male. Now I want to take a random sample with replacement from df and with the same size, while mean frequency of the new sample is 4 and male's proportion remains constant. I wonder if there is any way to do such thing in R. Thanks in advance.
I cannot find any particular function for what you want. But it will give the results you want. The combination of 'repeat' and if function play the same role as while loop, and other line means do sampling size of 4. repeat { df.sample = df[sample(nrow(df),size=4,replace=FALSE),] if(mean(df.sample$frequency) == 4.5 & mean(df.sample$male) == 0.5){ break } } The results is > df.sample id frequency male 4 4 4 0 2 2 2 0 9 9 9 1 3 3 3 1 For while loop, while(!(mean(df.sample$frequency) == 4.5 & mean(df.sample$male) == 0.5)){ df.sample = df[sample(nrow(df),size=4,replace=FALSE),] }