Conditional change of a column in a data frame - r

Apologies in advance if this has already been asked elsewhere, but I've tried different attempts and nothing has worked so far.
I have a data frame Data containing the measurements of air pollution. The columns "Measuring.Unit" and "Uncertainty.Unit" show that most of the measurements are expressed in "mol/L" but some of them are expressed are "mol/mL.
head(Data)
Locality.Name Chemical Concentration Measuring.Unit Uncertainty Uncertainty.Unit
1 xxxx NH3 0.065 mol/L 0.010 mol/L
2 xxxx CO 0.015 mol/L 0.004 mol/L
3 xxxx CO2 0.056 mol/L 0.006 mol/L
4 xxxx O3 0.67 mol/mL 0.010 mol/mL
5 xxxx H2SO4 0.007 mol/L 0.0008 mol/L
6 xxxx NO 0.89 mol/mL 0.08 mol/mL
Before starting any analysis, I want to change each value expressed in mol/mL in mol/L using a simple function and of course, change the associated character "mol/mL" in "mol/L". This should be something like this (but I guess there are much simple ways using dplyr or tidyverse)
:
# First step
if (Data$Measuring.Unit == "mol/mL") {Data$Concentration <- Data$Concentration * 1000 }
else {Data$Concentration <- Data$Concentration }
if (Data$Uncertainty.Unit == "mol/mL") {Data$Uncertainty <- Data$Uncertainty * 1000 }
else {Data$Uncertainty <- Data$Uncertainty}
# Second step
Data$Measuring.Unit[Data$Measuring.Unit == 'mol/mL'] <- 'mol/L'
Data$Uncertainty.Unit[Data$Uncertainty.Unit == 'mol/mL'] <- 'mol/L'

You can try:
Data$Concentration <- ifelse(Data$Measuring.Unit == "mol/mL",Data$Concentration * 1000,Data$Concentration)
Data$Uncertainty <- ifelse(Data$Uncertainty.Unit == "mol/mL",Data$Uncertainty * 1000,Data$Uncertainty)
This step looks fine:
Data$Measuring.Unit[Data$Measuring.Unit == 'mol/mL'] <- 'mol/L'
Data$Uncertainty.Unit[Data$Uncertainty.Unit == 'mol/mL'] <- 'mol/L'
if() is used for values while ifelse() is vectorized for dataframes.

Related

using lag for creating an x+1 column

I'm trying to implement a lag function but it seems i need an existing x column for it to work
lets say i have this data frames
df <- data.frame(AgeGroup=c("0-4", "5-39", "40-59","60-69","70+"),
px=c(.99, .97, .95, .96,.94))
i want a column Ix that is lag(Ix)*lag(px) starting from 1000.
The data i want is
df2 <- data.frame(AgeGroup=c("0-4", "5-39", "40-59","60-69","70+"),
px=c(.99, .97, .95, .96, .94),
Ix=c(1000, 990, 960.3, 912.285, 875.7936))
I've tried
library(dplyr)
df2<-mutate(df,Ix = lag(Ix, default = 1000)*lag(px))
ifelse statements don't work after the creation of a reference
df$Ix2=NA
df[1,3]=1000
df$Ix<-ifelse(df[,3]==1000,1000,
lag(df$Ix, default = 1000)*lag(px,default =1))
and have been playing around with creating separate Ix column with Ix=1000 then run the above but it doesn't seem to work. Does anyone have any ideas how i can create the x+1 column?
You could use cumprod() combined with dplyr::lag() for this:
> df$Ix <- 1000*dplyr::lag(cumprod(df$px), default = 1)
> df
AgeGroup px Ix
1 0-4 0.99 1000.0000
2 5-39 0.97 990.0000
3 40-59 0.95 960.3000
4 60-69 0.96 912.2850
5 70+ 0.94 875.7936
You can also use accumulate from purrr. Using head(px, -1) includes all values in px except the last one, and the initial Ix is set to 1000.
library(tidyverse)
df %>%
mutate(Ix = accumulate(head(px, -1), prod, .init = 1000))
Output
AgeGroup px Ix
1 0-4 0.99 1000.0000
2 5-39 0.97 990.0000
3 40-59 0.95 960.3000
4 60-69 0.96 912.2850
5 70+ 0.94 875.7936

Merging Embedded Lists with Different Date Formats in R

I need to merge two lists with each other but I am not getting what I want and I think it is because the "Date" column is in two different formats. I have a list called li and in this list there are 12 lists each with the following format:
> tail(li$fxe)
Date fxe
3351 2020-06-22 0.0058722768
3352 2020-06-23 0.0044256216
3353 2020-06-24 -0.0044998220
3354 2020-06-25 -0.0027309539
3355 2020-06-26 0.0002832672
3356 2020-06-29 0.0007552346
I am trying to merge each of these unique lists with a different list called factors which looks like :
> tail(factors)
Date Mkt-RF SMB HML RF
3351 20200622 0.0071 0.83 -1.42 0.000
3352 20200623 0.0042 0.15 -0.56 0.000
3353 20200624 -0.0261 -0.52 -1.28 0.000
3354 20200625 0.0112 0.25 0.50 0.000
3355 20200626 -0.0243 0.16 -1.37 0.000
3356 20200629 0.0151 1.25 1.80 0.000
The reason I need this structure is because I am trying to send them to a function I wrote to do linear regressions. But the first line of my function aims to merge these lists. When I merge them I end up with a null structure even thought my lists clearly have the same number of rows. In my function df is li. The embedded list of li is confusing me. Can someone help please?
Function I want to use:
Bf <- function(df, fac){
#This function calculates the beta of the french fama factor #using linear regression
#Input: df = a dataframe containg returns of the security
# fac = dataframe containing excess market retrun and
# french fama 3 factor
#Output: a Beta vectors of the french fama model
temp <- merge(df, fac, by="Date")
temp <- temp[, !names(temp) %in% "Date"]
temp[ ,1] <- temp[,1] - temp$RF return(lm(temp[,1]~temp[,2]+temp[,3]+temp[,4])$coeff)
}
a: you are dealing with data frames and not lists
b: if you want to merge them, you need to modify the factors$date column to match that of li$fxe$date
try to do:
factors$date <- as.Date(strptime(factors$date, format = "%Y%M%d"))
This should convert, the factors column to "Date" format.

R: Aggregating over several variables and observations (depending on values) and creating a new variable

The data set has the following structure
Key Date Mat Amount
<int> <date> <chr> <dbl>
1 1001056 2014-12-12 10025 0.10
2 1001056 2014-12-23 10025 0.20
3 1001056 2015-01-08 10025 0.10
4 1001056 2015-04-07 10025 0.20
5 1001056 2015-05-08 10025 0.20
6 1001076 2013-10-29 10026 3.00
7 1001140 2013-01-18 10026 0.72
8 1001140 2013-04-11 10026 2.40
9 1001140 2014-10-08 10026 0.24
10 1001237 2015-02-17 10025 2.40
11 1001237 2015-02-17 10026 3.40
Mat takes values in {10001,...,11000}, hence A:=|Mat|=1000.
I would like to accomplish the following goals:
1) (Intermediate step) For each Key-Date combination I would like to calculate for all materials, which are availabe at such a combination (which might vary from key to key), the differences in amount,
e.g. for combination "1001237 2015-02-17" this would be for materials 10025 and 10026 2.40-3.40=-1 (but might be more combinations). (How to store those values effienently?)
This step might be skipped.
2) Finally, I would like to construct a new matrix of dimension A=1000 where each entry (i,j) (Material combination i and j) contains the average of the values calculated in the step before.
More formally, entry (i,j) is given by,
1/|all key-date combinationas containing Mat i and Mat j| \sum_{all key-date combinationas containing Mat i and Mat j} Amount_i - Amount_j
As the table is quite large efficiency of the computation is very important.
Thank you very much for your help in advance!
I can do it with list columns in tidyverse; the trick is to use group_by to get distinct combinations of Key and Date. Here's the code:
materials <- unique(x$Mat)
n <- length(materials)
x <- x %>%
group_by(Key, Date) %>%
nest() %>%
# Create a n by n matrix for each combination of Key and Date
mutate(matrices = lapply(data,
function(y) {
out <- matrix(nrow = n, ncol = n,
dimnames = list(materials, materials))
# Only fill in when the pair of materials is present
# for the date of interest
mat_present <- as.character(unique(y$Mat))
for (i in mat_present) {
for (j in mat_present) {
# You may want to take an absolute value
out[i,j] <- y$Amount[y$Mat == i] - y$Amount[y$Mat == j]
}
}
out
}))
If you really want speed, you can implement the function in lapply with Rcpp. You can use RcppParallel to further speed it up. Now one of the columns of the data frame is a list of matrices. Then, for each element of the matrices, take an average while ignoring NAs:
x_arr <- array(unlist(x$matrices), dim = c(2,2,10))
results <- apply(x_arr, 2, rowMeans, na.rm = TRUE)
I stacked the list of matrices into a 3D array and found row means slice by slice. For performance, you can also do it in RcppArmadillo, with sum(x_arr, 2), but it's hard to deal with missing values when not all types of materials are represented in a combination of Key and Date.

R-How do I apply a for loop across multiple data frames?

In R, I have 3 dataframes that are similar to the sample versions I have provided below. The first Data is the primary data set, the TW and UW dataframes have a similar variable as Data (MN-mapping_for_N), and then 1000 different values for each variable (N48) etc. I provided 3 for my purposes here.
Data<-matrix(c(4720,44.29,"Work or Private Clinic",N48,2659,55.05,"Hospital",N1,1612,59.99,"No Care",N48),ncol = 4,byrow=TRUE)
colnames(Data)<-c("studyid", "Pred_ex", "wherecare", "MN-mapping_for_N")
Data<-data.frame(Data)
TW<-matrix(c("N48",0.07,0.08,0.09,"N1",0.10,0.11,0.12,"N2",0.02,0.03,0.04,"N3",0.04,0.05,0.06),ncol = 4, byrow = TRUE)
colnames(TW)<-c("MN-mapping_for_N","draw1","draw2","draw3")`
TW<-data.frame(TW)
UW<-matrix(c("N48",0.71,0.81,0.91,"N1",0.11,0.111,0.131,"N2",0.021,0.031,0.041,"N3",0.041,0.051,0.061),ncol = 4, byrow = TRUE)
colnames(UW)<-c("MN-mapping_for_N","draw1","draw2","draw3")`
UW<-data.frame(UW)
My goal is to create a new column with values from a randomly selected column from the UT and TW data, the correct one to draw from is predicated on the value in Data$wherecare
I have been using a mix of dplyr and the match function combined with a couple of functions of my own creation. Currently this looks like
drawselect<-function(x) {
samplepick<-sample(2:1001,1)
select(x,1,num_range("draw",samplepick))
}
DALY_FX_LT_NR<-function(x){
draw_T_DW<-drawselect(TW)
draw_UT_DW<-drawselect(UW)
drawnames.TW<-colnames((draw_T_DW))
drawnames.UT<-colnames(draw_UT_DW)
UT.draw<-drawnames.UT[2]
T.draw<-drawnames.T[2]
print(UT.draw)
print(T.draw)
newdf<-x %>% mutate(DW=NA)
for(i in 1:nrow(newdf)){
if(newdf$wherecare[i]!= "No Care"){
newdf$DW=draw_T_DW[,2][match(newdf$`MN-mapping_for_N`,draw_T_DW$`MN-mapping_for_N`)]
next
}else if(newdf$wherecare[i]=="No Care"){
newdf$DW=draw_UT_DW[,2][match(newdf$`MN-mapping_for_N`,draw_UT_DW$`MN-mapping_for_N`[i])]
}
}
newdf
}
The code runs, but I can't seem to get it to actually iterate row by row to make it pull from the correct dataframe for draw values (i.e. UT or TW after going through the drawselect function).
So what I get looks like:
-------------------------------------------------------------
studyid Pred_ex wherecare MN-mapping_for_N DW
--------- --------- ---------------------- ------------------ ------
4720 44.29 Work or Private Clinic N48 0.08
2659 55.05 Hospital N1 0.11
1612 59.99 No Care N48 0.08
--------------------------------------------------------------------
When I should be getting:
studyid Pred_ex wherecare MN-mapping_for_N DW
--------- --------- ---------------------- ------------------ ------
4720 44.29 Work or Private Clinic N48 0.08
2659 55.05 Hospital N1 0.11
1612 59.99 No Care N48 0.81
--------------------------------------------------------------------
The key difference being the 0.81 in the lower right corner, not a big deal in the sample data, but the actual data is several hundred rows long, so I would like to have the function "decide correctly" which dataset to pull from. This value could be 0.71,0.81 or 0.91, any of the UT values for N48 will work.
The ultimate goal will be to use that value in a calculation by multiplying by the Pred_ex column, which I can do, then rerun this function many times to bootstrap the data, but until I can get these if statements to work correctly I can't do that. I have also tried using dplyr::left_join to match these and had similar problems with the conditional statements not working. I think the match function as its written will work better, but I'm certainly open to anything.
Any help is greatly appreciated.
Also, thanks to everyone on stack overflow in general, reading your answers to other questions is the main reason I have gotten this far.
So you don't need a new function (I kept drawselect, you can just do the following:
for (i in 1:nrow(Data)){
if (Data$wherecare[i] != "No Care"){
Data$DW[i]<- drawselect(TW)[which(drawselect(TW)$MN.mapping_for_N == as.character(Data$MN.mapping_for_N[i])), 2]
} else {
Data$DW[i]<- drawselect(UW)[which(drawselect(UW)$MN.mapping_for_N == as.character(Data$MN.mapping_for_N[i])), 2]
}
}
> Data
studyid Pred_ex wherecare MN.mapping_for_N DW
1 4720 44.29 Work or Private Clinic N48 0.08
2 2659 55.05 Hospital N1 0.11
3 1612 59.99 No Care N48 0.81
If you want to wrap everything in a function (including drawselect), try something along the lines of the following:
DALY_FX_LT_NR<-function(x, y, z){ #x would be Data, y would be TW, z would be UW
samplepick<-sample(2:(ncol(y)-1),1)
for (i in 1:nrow(x)){
if (x$wherecare[i] != "No Care"){
x$DW[i]<- y[which(y$MN.mapping_for_N==as.character(x$MN.mapping_for_N[i])), paste0("draw", samplepick)]
} else {
x$DW[i]<- z[which(z$MN.mapping_for_N==as.character(x$MN.mapping_for_N[i])), paste0("draw", samplepick)]
}
}
return(x)
}
> DALY_FX_LT_NR(x = Data, y = TW, z = UW)
studyid Pred_ex wherecare MN.mapping_for_N DW
1 4720 44.29 Work or Private Clinic N48 0.09
2 2659 55.05 Hospital N1 0.12
3 1612 59.99 No Care N48 0.91

If statement for weighted averaged in R

I have a data file that is several million lines long, and contains information from many groups. Below is an abbreviated section:
MARKER GROUP1_A1 GROUP1_A2 GROUP1_FREQ GROUP1_N GROUP2_A1 GROUP2_A2 GROUP2_FREQ GROUP2_N
rs10 A C 0.055 1232 A C 0.055 3221
rs1000 A G 0.208 1232 A G 0.208 3221
rs10000 G C 0.134 1232 C G 0.8624 3221
rs10001 C A 0.229 1232 A C 0.775 3221
I would like to created a weighted average of the frequency (FREQ) variable (which in itself is straightforward), however in this case some of the rows are mismatched (rows 3 & 4). If the letters do not line up, then the frequency of the second group needs to be subtracted by 1 before the weighted mean of that marker is calculated.
I would like to set up a simple IF statement, but I am unsure of the syntax of such a task.
Any insight or direction is appreciated!
Say you've read your data in a data frame called mydata. Then do the following:
mydata$GROUP2_FREQ <- mydata$GROUP2_FREQ - (mydata$GROUP1_A1 != mydata$GROUP2_A1)
It works because R treats TRUE values as 1 and FALSE values as 0.
EDIT: Try the following instead:
mydata$GROUP2_FREQ <- abs( (as.character(mydata$GROUP1_A1) !=
as.character(mydata$GROUP2_A1)) -
as.numeric(mydata$GROUP2_FREQ) )

Resources