How to mutate row-wise and replace first 'n' values with 0 - r

I'm trying to build a measurement tracker form. I'd like to populate 0 into the "Measurement" columns for each row, based on the Quantity value for that row. (For Quantity = 2, first 2 measurements = 0, the rest of the row = NA). (For Quantity = 4, all measurements = 0).
I'm wondering how to mutate these rows and replace with the proper number of 0s in the correctly indexed positions, like so:
Feature Tool MIN MAX Quantity Measurement_1 Measurement_2 Measurement_3 Measurement_4
1 a m 0.5 1.0 2 0 0 NA NA
2 b n 0.4 1.2 4 0 0 0 0
The sample code to generate the dataframe is here:
#sample data
A1 <- data.frame(Feature = c("a","b"), Tool = c("m","n"), MIN = c(0.5,0.4), MAX = c(1.0,1.2), Quantity = c(2,4))
# Create empty data frame of NA
df <- data.frame(matrix(NA,
nrow = 1,
ncol = max(A1$Quantity)))
#create list of sequential measurements based on maximum quantity of measurements
M <- c(sprintf("Measurement_%01d", seq(1,max(A1$Quantity))))
#set column names to these measurements
colnames(df) <- M
#combine sample data with measurements
new_dat <- cbind(A1, df)
My first attempts to accomplish this are something like this:
new_dat %>% rowwise() %>% mutate(new_dat[,6:new_dat$Quantity] <- 0)
But it's clear I'm missing something here.
Thanks!

A vectorized approach in base R would be using row/column indexing. Create a matrix with row index replicated and the column index for each rows, and then do the assignment
j1 <- grep("Measurement", names(new_dat))
new_dat[cbind(rep(seq_len(nrow(new_dat)), new_dat$Quantity),
j1[sequence(new_dat$Quantity)])] <- 0
-output
> new_dat
Feature Tool MIN MAX Quantity Measurement_1 Measurement_2 Measurement_3 Measurement_4
1 a m 0.5 1.0 2 0 0 NA NA
2 b n 0.4 1.2 4 0 0 0 0
Or with dplyr, we could do
library(dplyr)
new_dat %>%
mutate(across(starts_with("Measurement"),
~ replace(.x, readr::parse_number(cur_column()) <= Quantity, 0)))
-output
Feature Tool MIN MAX Quantity Measurement_1 Measurement_2 Measurement_3 Measurement_4
1 a m 0.5 1.0 2 0 0 NA NA
2 b n 0.4 1.2 4 0 0 0 0

Related

How do I add a new column through a conditional mutate but preserve the original dataframe?

I have a large dataframe (df) containing 500+ rows, 50+ columns/variables but only want to target specific variables.
targ_vars <- c("all3a1", "3a1_arc",
"all3b1", "3b1_arc",
"all3c1", "3c1_arc")
The vector above contains the variables which have frequency data (i.e. multiple rows with 1,2,3 etc.)
I want to add a new count column in the original large dataframe (df) which contains the row sum of any non-NA value specifically for those select variables in "targ_vars".
Again, I'm not trying to add the value of the actual frequency data across each of those variables, but moreso just a sum of any non-NA value per row (i.e. 1,2,NA,7,NA,1 = total row count of 4 non-NA).
I've gotten as far as this:
df <- df %>%
select(targ_vars) %>%
mutate(targ_var_count = rowSums(!is.na(.), na.rm = TRUE))
The problem is I'm not sure how to "deselect" the variables I used to run the mutate calculation. The line above would result in overwriting the entire original dataframe (df) containing 50+ columns/vars, and placing back only the selected 6 variables in (targ_vars) plus the new (targ_var_count) variable that mutate calculated.
Essentially, I just want to focus on that last mutate line, and plop that new count column back into the original (df).
I tried something like the one below but it ended up giving me a list when I call "df$targcount" instead of just 1 rowSum column:
df$targcount <- df %>%
select(targ_vars) %>%
mutate(targcount = rowSums(!is.na(.), na.rm = TRUE))
Any help/tips would be appreciated.
You could use dplyr::across to get the count of non NA values for just your targ_vars columns.
Using some fake random example data:
set.seed(123)
dat <- data.frame(
a = sample(c(0, NA), 10, replace = TRUE),
b = sample(c(0, NA), 10, replace = TRUE),
c = sample(c(0, NA), 10, replace = TRUE),
d = sample(c(0, NA), 10, replace = TRUE)
)
targ_vars <- c("c", "d")
library(dplyr, w = FALSE)
dat %>%
mutate(targcount = rowSums(across(all_of(targ_vars), ~ !is.na(.x))))
#> a b c d targcount
#> 1 0 NA 0 0 2
#> 2 0 NA NA NA 0
#> 3 0 NA 0 0 2
#> 4 NA 0 0 NA 1
#> 5 0 NA 0 NA 1
#> 6 NA 0 0 0 2
#> 7 NA NA NA 0 1
#> 8 NA 0 NA 0 1
#> 9 0 0 0 0 2
#> 10 0 0 NA NA 0

Calculate the difference between a row in a dataset and all rows in another dataset in R

I have 2 datasets, i want for each row in datset1 to calculate the difference between all rows in another dataset2. I also replace any negative difference by 0. Here is a simple example of my 2 datasets (because i have datasets around 1000*1000).
df1 <- data.frame(ID = c(1, 2), Obs = c(1.0, 2.0), var=c(2.0,5.0))
df2 <- data.frame(ID = c(2, 1), Obs = c(3.0, 2.0),var=c(7.0,3.0))
df1
ID Obs var
1 1 1 2
2 2 2 5
df2
ID Obs var
1 2 3 7
2 1 2 3
for(i in 1:nrow(df1)){
b1=as.matrix(df1)
b2=as.matrix(df2)
diff= b1-b2
diff[which(diff < 0 )] <- 0
diff.data= data.frame(cbind(diff, total = rowSums(diff)))
}
diff.data
ID Obs var total
1 0 0 0 0
2 1 0 2 3
This is what i have been able to do, i did the difference between the 2 datasets, replace the negative values by 0 and also was interested to sum the values of the columns after. For the first row in df1 i would like to calculate the difference between all the rows in df2, and for the second row in df1 calculate the difference between all the rows in df2 (and so on). Note that i should not calculate the difference between the IDs (i don't know how to do it, maybe changing diff= b1-b2 by diff= b1[,-1]-b2[,-1]? ). I want to keep the ID from df1 to keep track of my patients (the case of my dataset). I would like to have something like that
diff.data
ID Obs var total
1 0 0 0
1 0 0 0
2 0 0 0
2 0 2 2
I thank you in advance for your help.
Here is what i have using your answer, i wanted to create a simple function. But i would like to have the option that my datasets could be either matrices or dataframes, i was only able to generate an error if the datasets are not dataframes:
difference=function(df1,df2){
if(class(df1) != "data.frame" || class(df2) != "data.frame") stop(" df1 or df2 is not a dataframe!")
df1=data.frame(df1)
df2=data.frame(df2)
ID1=seq(nrow(df1))
ID2=seq(nrow(df2))
new_df1 = df1[rep(ID1, each = nrow(df2)), ]
new_df1[-1] = new_df1[-1] - df2[rep(seq(nrow(df2)), nrow(df1)), -1]
new_df1[new_df1 < 0] = 0
new_df1$total = rowSums(new_df1[-1])
rownames(new_df1) = NULL
output=new_df1
return(output)
}
I know the fact that i specified df1=data.frame(df1) must be a dataframe its just i don't know how to also include that it could be a matrix.
Thank you again in advance for your help.
You can repeat each row in df1 with for nrow(df2) times and each row in df2 for nrow(df1) times so that the size of dataframes is equal and we can directly subtract the values.
#Repeat each row of df1 nrow(df2) times
new_df1 <- df1[rep(df1$ID, each = nrow(df2)), ]
#Repeat rows of df2 and subtract
new_df1[-1] <- new_df1[-1] - df2[rep(seq(nrow(df2)), nrow(df1)), -1]
#Replace negative values with 0
new_df1[new_df1 < 0] <- 0
#Add row-wise sum
new_df1$total <- rowSums(new_df1[-1])
#Remove rownames
rownames(new_df1) <- NULL
new_df1
# ID Obs var total
#1 1 0 0 0
#2 1 0 0 0
#3 2 0 0 0
#4 2 0 2 2

Replace row values if sum is equal to zero in R

I want to replace the values of columns by NA if the sum of their rows is equal to 0. Imagine the following columns:
a b
0 0
1 5
2 8
3 7
0 0
5 8
I would like to replace these by:
a b
NA NA
1 5
2 8
3 7
NA NA
5 8
I've been looking for answers on many pages but have not found any solution.
Here is what I have tried so far:
df[ , 31:36][df[,31:36] == 0 ] <- NA #With df being my dataframe and 31:36 the columns I want to apply the replacement too.
This replaces all the values equal to 0 by NA
I've also tried other alternatives using rowSums() but have not found a solution.
Any help would be greatly appreciated.
Thanks
How about this?
a <- df[31:36,1]
b <- df[31:36,2]
c <- a
a[a+b==0] <- NA
b[c+b==0] <- NA
df[31:36,1] <- a
df[31:36,2] <- b
We have to create a temporary variable called c, otherwise when you are checking the second column, you will be adding NA+0 which equals NA not 0.
An idiomatic way of doing this using dplyr would be:
library(dplyr)
tb <- tibble(
a = c(0, 1:3, 0, 5),
b = c(0, 5, 8, 7, 0, 8)
)
tb <- tb %>%
# creates a "rowsum" column storing the sum of columns 1:2
mutate(rowsum = rowSums(.[1:2])) %>%
# applies, to columns 1:2, a function that puts NA when the sum of the rows is 0
mutate_at(1:2, funs(ifelse(rowsum == 0, NA, .))) %>%
# removes rowsum
select(-rowsum)
Of course you could replace 1:2 with 31:36 when applying the code to your actual table.

R: Count of consecutive identical values across columns using RLE

I'm using R. I have a data frame that consists of a row for each player and then columns representing each month and a number of points they earned (illustrative data with random values below). I would like to add a new column (Points$ConsecutiveShutouts) that contains the longest consecutive streak for a specified point total over say the past 5 months.
Points <- data.frame("Player" = c("Alpha", "Beta", "Charlie", "Delta", "Echo", "Foxtrot", "Gamma"), "MayPts" = c(floor(runif(7, 0, 3))), "JunPts" = c(floor(runif(7, 0, 3))), "JulPts" = c(floor(runif(7, 0, 3))), "AugPts" = c(floor(runif(7, 0, 3))), "SepPts" = c(floor(runif(7, 0, 3))), "OctPts" = c(floor(runif(7, 0, 3))), "NovPts" = c(floor(runif(7, 0, 3))),"DecPts" = c(floor(runif(7, 0, 3))))
Player MayPts JunPts JulPts AugPts SepPts OctPts NovPts DecPts
Alpha 0 0 1 0 2 2 2 0
Beta 1 0 1 1 1 1 1 2
Charlie 1 2 2 0 2 1 1 0
Delta 0 1 1 2 2 2 0 0
Echo 1 1 0 2 1 2 0 1
Foxtrot 1 0 0 0 0 0 2 1
Gamma 2 0 1 1 0 2 0 1
I have tried using rle(points):
# Establish the start and end months
StartMonth <- which(colnames(Points) == "SepPts")
EndMonth <- which(colnames(Points) == "DecPts")
# Find total of consecutive months with 0 points
Points$ConsecutiveShutOuts <- max(rel(Points[ ,StartMonth:EndMonth] == 0), lengths[!values])
Doing this, I end up with the error "'X' must be a vector of an atomic type"
Any advice on what I am doing wrong and how I can fix? Or alternative approaches?
Thanks in advance! [Beginner here, so hopefully I followed the correct approach to question asking :)]
I would use long form as well. I would first create a function like this.
myfun <- function(series,value){
tmp <- rle(series); runs <- tmp$lengths[tmp$values == value]
if (length(runs)==0) return(0)
else return(max(runs))
}
Using tidyr/dplyr, you can proceed as
library(dplyr)
library(tidyr)
Points %>%
gather(months,Pts,MayPts:DecPts) %>%
group_by(Player) %>%
summarise(x=myfun(tail(Pts,5),0))
# Past 5 month, number of consecutive zeros for each player.
Of course, you can join the result to the original wide-form data frame if you'd like to.
If you want to sum based upon some condition (e.g., only summing points higher than 1), you can melt and restrict the summation only to rows greater than that value.
Points <- as.data.table(Points)
Points <- melt(Points, id="Player", variable.name = "Month", value.name = "PTs")
Points <- Points[PTs>1, list(PTs = sum(PTs, na.rm=TRUE)), by="Player"] #change ">1" if you prefer a different value

Replace value per row with value in first column

My question is very simple. I have a data frame with various numbers in each row, more than 100 columns. First column is always a non zero number. What I want to do is replace each nonzero number in each row (excluding the first column) with the first number in the row (the value of the first column)
I would think in the lines of an ifelse and a for loop that iterates through rows but there must be a simpler vectorised way to do it...
Another approach is to use sapply, which is more efficient than looping. Assuming your data is in a data frame df:
df[,-1] <- sapply(df[,-1], function(x) {ind <- which(x!=0); x[ind] = df[ind,1]; return(x)})
Here, we are applying the function over each and all columns of df except for the first column. In the function, x is each of these columns in turn:
First find the row indices of the column that are zeroes using which.
Set these rows in x to the corresponding values in the rows of the first column of df.
Returns the column
Note that the operations in the function are all "vectorized" over the column. That is, no looping over the rows of the column. The result from sapply is a matrix of the processed columns, which replaces all columns of df that are not the first column.
See this for an excellent review of the *apply family of functions.
Hope this helps.
Since you're data is not that big, I suggest you use a simple loop
for (i in 1:nrow(mydata))
{
for (j in 2:ncol(mydata)
{
mydata[i,j]<- ifelse(mydata[i,j]==0 ,0 ,mydata[i,1])
}
}
Suppose your data frame is dat, I have a fully-vectorized solution for you:
mat <- as.matrix(dat[, -1])
pos <- which(mat != 0)
mat[pos] <- rep(dat[[1]], times = ncol(mat))[pos]
new_dat <- "colnames<-"(cbind.data.frame(dat[1], mat), colnames(dat))
Example
set.seed(0)
dat <- "colnames<-"(cbind.data.frame(1:5, matrix(sample(0:1, 25, TRUE), 5)),
c("val", letters[1:5]))
# val a b c d e
#1 1 1 0 0 1 1
#2 2 0 1 0 0 1
#3 3 0 1 0 1 0
#4 4 1 1 1 1 1
#5 5 1 1 0 0 0
My code above gives:
# val a b c d e
#1 1 1 0 0 1 1
#2 2 0 2 0 0 2
#3 3 0 3 0 3 0
#4 4 4 4 4 4 4
#5 5 5 5 0 0 0
You want a benchmark?
set.seed(0)
n <- 2000 ## use a 2000 * 2000 matrix
dat <- "colnames<-"(cbind.data.frame(1:n, matrix(sample(0:1, n * n, TRUE), n)),
c("val", paste0("x",1:n)))
## have to test my solution first, as aichao's solution overwrites `dat`
## my solution
system.time({mat <- as.matrix(dat[, -1])
pos <- which(mat != 0)
mat[pos] <- rep(dat[[1]], times = ncol(mat))[pos]
"colnames<-"(cbind.data.frame(dat[1], mat), colnames(dat))})
# user system elapsed
# 0.352 0.056 0.410
## solution by aichao
system.time(dat[,-1] <- sapply(dat[,-1], function(x) {ind <- which(x!=0); x[ind] = dat[ind,1]; x}))
# user system elapsed
# 7.804 0.108 7.919
My solution is 20 times faster!

Resources