How Do I Mutate Values Differently In Separate Columns - r

I am fairly new to R and I feel like this should be pretty straightforward, but I keep getting errors. I have a data table called Master2 that has concentration data for several analytes, taken at different stations. My analytes are my column names. Several ND, or non-detect values exist in each column. I would like to change NDs in my TKN column to 0.05 and NDs in all other columns to 0.005.
I was able to easily change all values in the frame to 0.005 with this code:
Master2 <- Master2 %>%
mutate_if(is.character, ~ if_else(. == "ND", "0.005", .))
I have tried a variation of approaches (replace, mutate_at...) to try to change NDs in the TKN column separately prior to running this line of code with no success. Below is a mock up of my data, any help is greatly appreciated!
Master2 <- structure(list(Station = c(C3A,C3A,C3A,MD10,MD10,MD10,C10A),
Date = c(1/15/2009,1/16/2009,1/17/2009,1/18/2009,1/19/2009,1/20/2009,1/21/2009),
DissAmmonia = c(0.3,0.25,0.18,ND,1.2,0.5,0.8),
DissNitrateNitrite = c(0.6,ND,0.15,0.2,0.4,0.6,ND),
TotPhos = c(0.1,0.3,ND,0.4,0.2,0.12,0.1),
TKN = c(ND,0.2,0.13,0.5,ND,0.8,1.2)))

You can do :
#Change 'ND' values in TKN to 0.05
Master2$TKN[Master2$TKN == 'ND'] <- 0.05
#Change 'ND' values in all other columns to 0.005
Master2[Master2 == 'ND'] <- 0.005
#Change the classes of data to respective types.
Master2 <- type.convert(Master2, as.is = TRUE)
Master2
# Station Date DissAmmonia DissNitrateNitrite TotPhos TKN
#1 C3A 0.00003318401 0.300 0.600 0.100 0.05
#2 C3A 0.00003111000 0.250 0.005 0.300 0.20
#3 C3A 0.00002928000 0.180 0.150 0.005 0.13
#4 MD10 0.00002765334 0.005 0.200 0.400 0.50
#5 MD10 0.00002619790 1.200 0.400 0.200 0.05
#6 MD10 0.00002488800 0.500 0.600 0.120 0.80
#7 C10A 0.00002370286 0.800 0.005 0.100 1.20

If you want to stick with dplyr and still use mutate, try:
Master2 <- Master2 %>%
mutate(TKN = case_when(TKN == "ND" ~ 0.05, TRUE ~ TKN)) %>%
mutate(across(select(-TKN), ~case_when(cur_column() == "ND" ~ 0.005,
TRUE ~ cur_column())))
As you haven't provided a reproducible example, I can't test this code to verify I get your desired output

A combination of my code and #Ronak Shah's solution appears to work!
#Change 'ND' values in TKN to 0.05
Master2$TKN[Master2$TKN == 'ND'] <- 0.05
#Change 'ND' values in all other columns to 0.005
Master2 <- Master2 %>%
mutate_if(is.character, ~ if_else(. == "ND", "0.005", .))

Related

using lag for creating an x+1 column

I'm trying to implement a lag function but it seems i need an existing x column for it to work
lets say i have this data frames
df <- data.frame(AgeGroup=c("0-4", "5-39", "40-59","60-69","70+"),
px=c(.99, .97, .95, .96,.94))
i want a column Ix that is lag(Ix)*lag(px) starting from 1000.
The data i want is
df2 <- data.frame(AgeGroup=c("0-4", "5-39", "40-59","60-69","70+"),
px=c(.99, .97, .95, .96, .94),
Ix=c(1000, 990, 960.3, 912.285, 875.7936))
I've tried
library(dplyr)
df2<-mutate(df,Ix = lag(Ix, default = 1000)*lag(px))
ifelse statements don't work after the creation of a reference
df$Ix2=NA
df[1,3]=1000
df$Ix<-ifelse(df[,3]==1000,1000,
lag(df$Ix, default = 1000)*lag(px,default =1))
and have been playing around with creating separate Ix column with Ix=1000 then run the above but it doesn't seem to work. Does anyone have any ideas how i can create the x+1 column?
You could use cumprod() combined with dplyr::lag() for this:
> df$Ix <- 1000*dplyr::lag(cumprod(df$px), default = 1)
> df
AgeGroup px Ix
1 0-4 0.99 1000.0000
2 5-39 0.97 990.0000
3 40-59 0.95 960.3000
4 60-69 0.96 912.2850
5 70+ 0.94 875.7936
You can also use accumulate from purrr. Using head(px, -1) includes all values in px except the last one, and the initial Ix is set to 1000.
library(tidyverse)
df %>%
mutate(Ix = accumulate(head(px, -1), prod, .init = 1000))
Output
AgeGroup px Ix
1 0-4 0.99 1000.0000
2 5-39 0.97 990.0000
3 40-59 0.95 960.3000
4 60-69 0.96 912.2850
5 70+ 0.94 875.7936

Conditional change of a column in a data frame

Apologies in advance if this has already been asked elsewhere, but I've tried different attempts and nothing has worked so far.
I have a data frame Data containing the measurements of air pollution. The columns "Measuring.Unit" and "Uncertainty.Unit" show that most of the measurements are expressed in "mol/L" but some of them are expressed are "mol/mL.
head(Data)
Locality.Name Chemical Concentration Measuring.Unit Uncertainty Uncertainty.Unit
1 xxxx NH3 0.065 mol/L 0.010 mol/L
2 xxxx CO 0.015 mol/L 0.004 mol/L
3 xxxx CO2 0.056 mol/L 0.006 mol/L
4 xxxx O3 0.67 mol/mL 0.010 mol/mL
5 xxxx H2SO4 0.007 mol/L 0.0008 mol/L
6 xxxx NO 0.89 mol/mL 0.08 mol/mL
Before starting any analysis, I want to change each value expressed in mol/mL in mol/L using a simple function and of course, change the associated character "mol/mL" in "mol/L". This should be something like this (but I guess there are much simple ways using dplyr or tidyverse)
:
# First step
if (Data$Measuring.Unit == "mol/mL") {Data$Concentration <- Data$Concentration * 1000 }
else {Data$Concentration <- Data$Concentration }
if (Data$Uncertainty.Unit == "mol/mL") {Data$Uncertainty <- Data$Uncertainty * 1000 }
else {Data$Uncertainty <- Data$Uncertainty}
# Second step
Data$Measuring.Unit[Data$Measuring.Unit == 'mol/mL'] <- 'mol/L'
Data$Uncertainty.Unit[Data$Uncertainty.Unit == 'mol/mL'] <- 'mol/L'
You can try:
Data$Concentration <- ifelse(Data$Measuring.Unit == "mol/mL",Data$Concentration * 1000,Data$Concentration)
Data$Uncertainty <- ifelse(Data$Uncertainty.Unit == "mol/mL",Data$Uncertainty * 1000,Data$Uncertainty)
This step looks fine:
Data$Measuring.Unit[Data$Measuring.Unit == 'mol/mL'] <- 'mol/L'
Data$Uncertainty.Unit[Data$Uncertainty.Unit == 'mol/mL'] <- 'mol/L'
if() is used for values while ifelse() is vectorized for dataframes.

Creating a ranking column based on values in two other columns using mutate and min_rank

I'm attempting to revisit some older code in which I used a for loop to calculate a combined ranking of genes based on two columns. My end goal is to get out a column that lists the proportion of genes that any given gene in the dataset performs better than.
I have a data.frame that I'm calling scores which contains two columns of relevant scores for my genes. To calculate the combined ranking I use the following for loop and I calculate the proportional score by dividing the resulting rank by the total number of observations.
scores <- data.frame(x = c(0.128, 0.279, 0.501, 0.755, 0.613), y = c(1.49, 1.43, 0.744, 0.647, 0.380))
#Calculate ranking
comb.score = matrix(0, nrow = nrow(scores), ncol = 1)
for(i in 1:nrow(scores)){
comb.score[i] = length(which(scores[ , 1] < scores[i, 1] & scores[ , 2] < scores[i, 2]))
}
comb.score <- comb.score/length(comb.score) #Calculate proportion
Now that I've become more familiar and comfortable with the tidyverse I want to convert this code to use tidyverse functions but I haven't been able to figure it out on my own, nor with SO or RStudio community answers.
The idea I had in mind was to use mutate() along with min_rank() but I'm not entirely sure of the syntax. Additionally the behavior of min_rank() appears to assess rank using a logical test like scores[ , 1] <= scores[i, 1] as opposed to just using < like I did in my original test.
My expected out come is an additional column in the scores table that has the same output as the comb.score output in the above code: a score that tells me the proportion of genes in the whole dataset that a gene on a given row performs better than.
Any help would be much appreciated! If I need to clarify anything or add more information please let me know!
Interessting question. I propose this way:
scores %>%
rowwise() %>%
mutate(comb_score = sum(x > .$x & y > .$y)) %>%
ungroup() %>%
mutate(comb_score = comb_score/n())
which gives
# A tibble: 5 x 3
x y comb_score
<dbl> <dbl> <dbl>
1 0.128 1.49 0
2 0.279 1.43 0
3 0.501 0.744 0
4 0.755 0.647 0.2
5 0.613 0.38 0
A bit similar to Martins answer, but using pmap instead.
library(tidyverse)
scores <- data.frame(
x = c(0.128, 0.279, 0.501, 0.755, 0.613),
y = c(1.49, 1.43, 0.744, 0.647, 0.380)
)
scores %>%
mutate(
score = pmap(list(x, y), ~ sum(..1 > x & ..2 > y)) / n()
)
#> x y score
#> 1 0.128 1.490 0
#> 2 0.279 1.430 0
#> 3 0.501 0.744 0
#> 4 0.755 0.647 0.2
#> 5 0.613 0.380 0
Created on 2020-06-18 by the reprex package (v0.3.0)

Taking the mean of 10000 replications of random sampling for each row

I did a replication 10000 times where I took a random sample from a list of ID's and then paired them with another list of IDs. After that I added a colomn that gives the relatedness of pair to each other. Then I took thee mean of the relatedness for each set of random sampling. So I end up with 10000 values which represent the mean of the relatedness for each set of random sampling. However, I want to instead take the mean of the relatedness of first row for all the 10000 sets of random sampling.
An example of what I want:
Lets say I have 10000 sets of 3 random pairings.
Set 1
female_ID male_ID relatedness
0 12-34 23-65 0.034
1 44-62 56-24 0.56
2 76-11 34-22 0.044
Set 2
female_ID male_ID relatedness
0 98-54 53-12 0.022
1 22-43 13-99 0.065
2 09-22 65-22 0.12
etc...
I want the mean of the rows for relatedness of each set, so I want a list of 3 values: 0.028 (mean of 0.034 and 0.022), 0.3125 (mean of 0.56 and 0.065), 0.082 (mean of 0.044 and 0.12), except it would be the mean across 10000 sets, and not just 2.
Here's my code so far:
mean_rel <- replicate(10000, {
random_mal <- sample(list_of_males, 78, replace=TRUE)
random_pair <- cbind(list_of_females, random_mal)
random_pair <- data.frame(random_pair)
random_pair$pair <- with(random_pair, paste(list_of_females, random_mal, sep = " "))
typeA <- genome$rel[match(random_pair$pair, genome_year$pair1)]
typeB <- genome$rel[match(random_pair$pair, genome_year$pair2)]
random_pair$relatedness <- ifelse(is.na(typeA), typeB, typeA)
random_pair <- na.omit(random_pair)
mean_random_pair_relatedness <- mean(random_pair$relatedness)
mean_random_pair_relatedness
})
If you add simplify = FALSE to your replicate after the between the closing } and ), then mean_rel will be output as a list.
mean_rel <- replicate(10000, {
random_mal <- sample(list_of_males, 78, replace=TRUE)
random_pair <- cbind(list_of_females, random_mal)
random_pair <- data.frame(random_pair)
random_pair$pair <- with(random_pair, paste(list_of_females, random_mal, sep = " "))
typeA <- genome$rel[match(random_pair$pair, genome_year$pair1)]
typeB <- genome$rel[match(random_pair$pair, genome_year$pair2)]
random_pair$relatedness <- ifelse(is.na(typeA), typeB, typeA)
random_pair <- na.omit(random_pair)
mean_random_pair_relatedness <- mean(random_pair$relatedness)
mean_random_pair_relatedness
}, simplify = FALSE)
From there, you can use purrr to add two classification columns and then can use dplyr for the rest. Here is how I did it:
library(tidyverse)
mean_rel <- purrr::map2(.x = mean_rel, .y = seq_along(mean_rel),
function(x, y){
x %>%
mutate(set = paste0("set_", y)) %>%
# do this so the same row of each set can be
# compared
rownames_to_column(var = "row_number")
})
mean_rel_comb <- mean_rel %>%
do.call(rbind, .) %>%
as.tibble() %>%
mutate(relatedness = as.numeric(as.character(relatedness))) %>%
group_by(row_number) %>%
summarize(mean = mean(relatedness))
Using your two datasets combined as a list gave me this:
# A tibble: 3 x 2
row_number mean
<chr> <dbl>
1 1 0.0280
2 2 0.3125
3 3 0.0820

Simplest creation of factor variable from dummies

A selected answer to a question here:
creating a factor variable with dplyr?
Did not impress Hadley and the follow-up answer does not generalise well for some of the problems I've come across. I'm wondering if the community can do something better with a simpler example:
### DATA ###
A = round(runif(200,0,1),0)
B = c(1 - A[1:100],rep(0,100))
C = c(rep(0,100), 1 - A[101:200])
dummies <- as.data.frame(cbind(A,B,C))
header <- c("Christian", "Muslim", "Athiest")
names(dummies) <- header
### ONE WAY ###
dummies$Religion <- factor(ifelse(dummies$Christian==1, "Christian",
ifelse(dummies$Muslim==1, "Muslim",
ifelse(dummies$Athiest==1, "Athiest", NA))))
Solution mimics the result provided to the OP in the link above. Is there a simpler function to collapse the dummy variables to one factor variable, like say the egen group function in STATA?? Simple one liner would be great.
Using Akrun's solution and system time (thank you):
set.seed(24)
A = round(runif(2e6,0,1),0)
B = c(1 - A[1:1e6],rep(0,1e6))
C = c(rep(0,1e6), 1 - A[1000001:2000000])
dummies <- as.data.frame(cbind(A,B,C))
header <- c("Christian", "Muslim", "Athiest")
names(dummies) <- header
attach(dummies)
#Alistaire
system.time({
dummies %>% rowwise() %>%
transmute(religion = names(.)[as.logical(c(Christian, Muslim, Athiest))])
})
# user system elapsed
# 56.08 0.00 56.08
system.time({
dummies %>% transmute(religion = case_when(
as.logical(Christian) ~ 'Christian',
as.logical(Muslim) ~ 'Muslim',
as.logical(Athiest) ~ 'Atheist'))
})
# user system elapsed
# 0.22 0.04 0.27
#Curt F.
system.time({
dummies %>%
gather(religion, is_valid) %>%
filter(is_valid == T) %>%
select(-is_valid)
})
# user system elapsed
# 0.33 0.03 0.36
#Akrun
system.time({
names(dummies)[as.matrix(dummies)%*% seq_along(dummies)]
})
# user system elapsed
# 0.13 0.06 0.21
system.time({
names(dummies)[max.col(dummies, "first")]
})
# user system elapsed
# 0.04 0.07 0.11
I find that Akrun's solution works out to be the fastest method and provided 2 one-liners. However, many thanks to the others for their unique approaches to the problem and generous supply of coding methods that I would like to learn more about, especially the use of %%, names(.), is_valid and the qdapTools package.
A quick way with dplyr would be
dummies %>% rowwise() %>%
transmute(religion = names(.)[as.logical(c(Christian, Muslim, Athiest))])
What Hadley's really complaining about in that answer is nested ifelse structure, though. He's built case_when to replace it:
dummies %>% transmute(religion = case_when(
as.logical(Christian) ~ 'Christian',
as.logical(Muslim) ~ 'Muslim',
as.logical(Athiest) ~ 'Atheist'))
We can use
dummies$Religion <- names(dummies)[as.matrix(dummies)%*% seq_along(dummies)]
Or with max.col
dummies$Religion <- names(dummies)[max.col(dummies, "first")]
If there are rows that have only 0 elements, then
dummies$Religion <- names(dummies)[max.col(dummies, "first")*NA^(!rowSums(dummies))]
NOTE: In all of the above solution, it can be wrapped with factor. But, it is better to keep it as character
NOTE2: Both the solutions are base R only one-line solutions and are very fast compared to any package solution (proof is showed in the benchmarks below)
Benchmarks
set.seed(24)
A = round(runif(2e6,0,1),0)
B = c(1 - A[1:1e6],rep(0,1e6))
C = c(rep(0,1e6), 1 - A[1000001:2000000])
dummies <- data.frame(A,B,C)
colnames(dummies) <- c("Christian", "Muslim", "Athiest")
system.time({
dummies %>% rowwise() %>%
transmute(religion = names(.)[as.logical(c(Christian, Muslim, Athiest))])
})
# user system elapsed
# 49.13 0.06 49.55
system.time({
dummies %>% transmute(religion = case_when(
as.logical(Christian) ~ 'Christian',
as.logical(Muslim) ~ 'Muslim',
as.logical(Athiest) ~ 'Atheist'))
})
#Error in mutate_impl(.data, dots) : object 'Christian' not found
#Timing stopped at: 0 0 0
system.time({
names(dummies)[as.matrix(dummies)%*% seq_along(dummies)]
})
# user system elapsed
# 0.11 0.01 0.13
system.time({
names(dummies)[max.col(dummies, "first")]
})
# user system elapsed
# 0.07 0.02 0.08
One way to do this is to combine tidyr and dplyr. This may not give the fastest performance (I haven't checked), but to me at least it gives the easiest-to-understand code.
Start with the dummies data frame from the OP:
A = round(runif(200,0,1),0)
B = c(1 - A[1:100],rep(0,100))
C = c(rep(0,100), 1 - A[101:200])
dummies <- as.data.frame(cbind(A, B, C))
header <- c("Christian", "Muslim", "Atheist")
names(dummies) <- header
Then the gather() function from tidyr does the heavy lifting, and filter() and select() from dplyr do the cleanup.
require(tidyr)
require(dplyr)
dummies %>%
gather(religion, is_valid) %>%
filter(is_valid == T) %>%
select(-is_valid)
The nice thing about this version is that it doesn't make any assumptions about the one-hotness of the initial dataframe. If some row in the initial frame is both an atheist and a Christian, your output will have two rows.
If the main intent of the OP is to create the Religion column this can be done directly in one call:
Religion <- sample(c("Christian", "Muslim", "Atheist"), 200, replace = TRUE,
prob = c(60, 20, 20))
The parameter prob can be used to specify the probability weights. Just to check:
table(Religion)
#Religion
# Atheist Christian Muslim
# 37 115 48
However, if the dummies data.frame would be required for some reason, it could be created from the Religion vector with the following code:
mat <- sapply(unique(Religion), function(x) as.integer(Religion == x))
dummies <- cbind(as.data.frame(mat), Religion)
This will result in:
head(dummies)
# Muslim Christian Atheist Religion
#1 1 0 0 Muslim
#2 1 0 0 Muslim
#3 0 1 0 Christian
#4 1 0 0 Muslim
#5 0 1 0 Christian
#6 0 0 1 Atheist
Note that the result may look different for different runs of sample() as we haven't used set.seed() before calling sample().
From this answer I learned about the mtabulate() function from package qdapTools which can replace the sapply() construct by a one-liner:
dummies <- cbind(qdapTools::mtabulate(Religion), Religion)

Resources