I have a large dataframe that contains many columns, but the relevant ones are: ID (this is number assigned to subject), Time (time at which this subject's measurement was taken) and Concentration.
A very simplified example would be:
df <- data.frame( ID=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3),
Concentration=c("XXX",0.3,0.7,0.6,"XXX","XXX",0.8,0.3,"XXX","XXX",
"XXX",0.6,0.1,0.1,"XXX"),
Time=c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5))
I would like to replace only the "XXX" values in column Concentration based on the following conditions:
when the value in column Time is less than or equal to 3; "XXX"==0
when the value in column Time is greater than 3; "XXX" should be replaced with the word "Missing"
unless two consecutive "XXX" values appear for a single subject (ID) for Time>3 then the first
consecutive "XXX" should be replaced with 0.05 and the second consecutive "XXX" (or all the following "XXX" values if there are more) should be replaced with the word "Missing".
I have tried mutate_at and replace_na, some ifelse statements and case_when but I just cannot seem to figure out how to correctly do it. Any help would be greatly appreciated!
Edit: Just to show some work:
df[df == "XXX" & df$Time<3] <- as.numeric(0)
df[df == "BLQ" & df$Time>3] <- as.character("Missing")
I have managed to find a simple and robst solution that takes care of the first two parts of my problem what I'm stuck on is the last part - when there are two or more consecutive "XXX" values for a single subject after Time>3. I imagine I should loop an ifelse statement over and index list of the ID's or something like that, but I can't figure out how to do that.
It's very important that the ID's are somehow seperated here because there could be "XXX" as the final Concentration of one ID and as the first Concentration of the next ID and I do not want that to be read as two consecutive "XXX" values for a single ID.
I solved it using some functions of tidyverse, and I also added some other records to your example.
rm(list = ls(all=TRUE))
require(tidyverse)
df <- data.frame( ID=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3,3,3),
Concentration=c("XXX",0.3,0.7,0.6,"XXX","XXX",0.8,0.3,"XXX","XXX",
"XXX",0.6,0.1,0.1,"XXX",0.2,"XXX","XXX",1),
Time=c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,6,7,8,9))
df <- tibble(df) %>%
mutate(Concentration = as.character(Concentration),
Concentration_Original = Concentration) %>%
mutate(Concentration = ifelse(Concentration == 'XXX' & Time <= 3, "0", Concentration)) %>%
group_by(ID) %>%
mutate(Concentration = ifelse(Concentration == 'XXX' & Concentration == lead(Concentration),
"0.05", ifelse(Concentration == 'XXX',
"Missing", Concentration))) %>%
replace_na(list(Concentration = "Missing")) %>% ungroup()
I just had to figure this out a few minutes ago and found this question while looking for a better version. Here's mine:
value_swap <- function(dataset, specified_columns, original_val, new_val){
temp. <- dataset
temp.[, specified_columns][
temp.[, specified_columns] == original_val] <- new_val
return(temp.)
}
value_swap(mtcars, c("cyl","gear"), 4, 3.99)
You'll notice the 4s in the cylinder and gear columns mtcars are now 3.99, but the carb column is left alone.
As for your conditionals; you can just subset your dataset into 3 different ones based on your conditions, run the custom value_swaps for each condition, then rbind them back. Much simpler than doing a giant nested one in my opinion.
Related
I am trying to streamline the process of auditing chemistry laboratory data. When we encounter data where an analyte is not detected I need to change the recorded result to a value equal to 1/2 of the level of detection (LOD) for the analytical method. I have LOD's contained within another dataframe to be used as a lookup table.
I have multiple columns representing data from different analytical tests, each with it's own unique LOD. Here's an example of the type of data I am working with:
library(tidyverse)
dat <- tibble("Lab_ID" = as.character(seq(1,10,1)),
"Tributary" = c('sawmill','paint', 'herring', 'water',
'paint', 'sawmill', 'bolt', 'water',
'herring', 'sawmill'),
"date" = rep(as.POSIXct("2021-10-01 12:00:00"), 10),
"TP" = c(1.5,15.7,-2.3,7.6,0.1,45.6,12.2,-0.1,22.2,0.6),
"TN" = c(100.3,56.2,-10.5,0.4,-0.3,11.0,45.8,256.0,12.2,144.0),
"DOC" = c(56.0,120.3,-10.5,0.2,14.6,489.3,0.3,14.4,54.6,88.8))
dat
detect_level <- tibble("Parameter" = c('TP', 'TN', 'DOC'),
'LOD' = c(0.6, 11, 0.3)) %>%
mutate(halfLOD=LOD/2)
detect_level
I have poured over multiple other questions with a similar theme:
Change values in multiple columns of a dataframe using a lookup table
R - Match values from multiple columns in a data.frame to a lookup table.
Replace values in multiple columns using different thresholds
and gotten to a point where I have pivoted the data and split it out into a list of dataframes that are specific analytes:
dat %>%
pivot_longer(cols = c('TP','TN','DOC')) %>%
arrange(name) %>%
split(.$name)
I have tried to apply a function using map(), however I cannot figure out how to integrate the values from the lookup table (detect_level) into my code. If someone could help me continue this pipe, or finish the process to achieve a final product dat2 that should look like this I would appreciate it:
dat2 <- tibble("Lab_ID" = as.character(seq(1,10,1)),
"Tributary" = c('sawmill','paint', 'herring', 'water',
'paint', 'sawmill', 'bolt', 'water',
'herring', 'sawmill'),
"date" = rep(as.POSIXct("2021-10-01 12:00:00"), 10),
"TP" = c(1.5,15.7,0.3,7.6,0.3,45.6,12.2,0.3,22.2,0.6),
"TN" = c(100.3,56.2,5.5,5.5,5.5,11.0,45.8,256.0,12.2,144.0),
"DOC" = c(56.0,120.3,0.15,0.15,14.6,489.3,0.3,14.4,54.6,88.8))
dat2
Another possibility would be from the closest similar question I have found is:
Lookup multiple column from a single table
Here's a snippet of code that I have adapted from this question, however, if you run it you will see that where values exist that are not found in detect_level an NA is returned. Additionally, it does not appear to have worked for $TN or $DOC, even in cases when the $LOD value from detect_level was present.
dat %>%
mutate(across(all_of(unique(detect_level$Parameter)),
~ {i1 <- detect_level$Parameter == cur_column()
detect_level$LOD[i1][match(., detect_level$LOD)]}))
I am not comfortable at all with the purrr language here and have only adapted this code from the question linked, so I would appreciate if this is the direction an answerer chooses, that they might comment code to explain briefly what is happening "under the hood".
Thank you in advance!
Perhaps this helps
library(dplyr)
dat %>%
mutate(across(all_of(detect_level$Parameter),
~ pmax(., detect_level$LOD[match(cur_column(), detect_level$Parameter)])))
For the updated case
dat %>%
mutate(across(all_of(detect_level$Parameter),
~ replace(., . < detect_level$LOD[match(cur_column(),
detect_level$Parameter)],detect_level$halfLOD[match(cur_column(),
detect_level$Parameter)])))
I have a dataframe that I need to remove duplicates based on the variable "e-mail". However, there's a lot of NA's there that I cannot get rid of because they're valuable observations. Besides NA's, some people happened to put a dot in it, so I want to know if I can get rid of the rows with duplicated e-mails while ignoring NA's and the observations with "." on the email.
I've tried distinct() and n_distinct() but both of these don't have a na.rm option.
Here's an example of what i mean:
library(dplyr)
email <- c("xxx#xxx.xxx","xxx#xxx.xxx","yyy#yyy.yyy","yyy#yyy.yyy","zzz#zzz.zzz","zzz#zzz.zzz",".",".",".",".",".")
names <- c("Gabriel","Marcos","Julio","Rafael","Victor","Azymov","Turkey Sandvich","Marzia","Door","Cato","Doggo")
test <- data.frame(email,names)
morenames <- c("Soap","Redbull","World of Warcraft")
moreemails <- c(NA,NA,NA)
test2 <- data.frame(moreemails, morenames)
names(test2) <- c("email","names")
test <- test %>% rbind(test2)
test
verif_dup <- test[duplicated(test[,1]),]
verif_dup
I can see all the duplicate emails on verif_dup. I want a way to remove the duplicates like xxx#xxx.xxx, yyy#yyy.yyy and zzz#zzz.zzz, but keep the "." and NA's.
Unfortunately I could not think of an elegant way to do this in one step. I split the operations into one data frame for isolating the NA and ".", and another for the duplicates.
test_missing <- test %>%
filter(is.na(email) | email == '.')
# here I assume you keep the first row of the duplicated values (the 'names' column is different for each)
test_duprm <- test %>%
filter(!is.na(email) & email != '.') %>%
group_by(email) %>%
filter(row_number() == 1)
bind_rows(test_missing, test_duprm)
I have right now a dataset with more than 186k observations (rows), this is presented in figure 1. These are all companies in BVDID column and they should contain data in all years of 2013 to 2017.
missingdata <- series %>% filter(LIABILITIES == 0) %>% select(BVDID)
However, I found 87k rows of only zero-values in missingdata object using the code above.
How do I delete the rows of the series object with BVDID (company code) in the dataframe missing data? Also there should be a way to make those years look better under my str(series) and put them ascending based on each company code.
Best regards
THERE are many ways, one such way.
use tidyverse anti_join function which gives the result as similar to set operation A-B and therefore will remove all matching rows from the second data.
series %>% anti_join(missingdata, by =c("BVDID" = "BVDID"))
Or directly. Liabilities == 0 will return boolean values, adding + before it converts these to 0 or 1 and checking the sum of these values if greater than 1, which are to be removed.
series %>% group_by(BVDID) %>% filter(sum(+(LIABILITIES == 0)) > 0)
series %>%
# filter out the BVDIDs from missingdata
filter(!BVDID %in% pull(missingdata)) %>%
# order the df
arrange(BVDID, year)
Here I have the code with a for loop:
for (i in 1:length(mc_1$code))
{cmc1 = mc_1$code[i]
cmc2 = mc_1[mc_1$code == cmc1,]
cmc3 = cmc2[order(cmc2[ ,2], cmc2[ ,3]),]
mc_1[mc_1$code == cmc1,]$region = last(cmc3$region)
}
For each value in the variable "code", mc_1 have different number of rows. And mc_1 also has columns of year and month (column 2 and 3), and another column, say, region. "region" is different even for same "code" at different month and year.
For each "code", I want to select only the most recent region by month and year (that's why I use "order") and assign that region to all the regions in all the rows for that certain code.
I did have this for loop, which works. But for efficiency and code length issue, how can I rewrite it better using something like data table or dplyr?
you can try this using the dplyr package
and the fact that n() returns the number of rows in each group
mc_1 %>%
group_by(code) %>%
arrange(year, month ) %>%
mutate(region = region[n()])
hope it helps!!
Since I have to read over 3 go of data, I would like to improve mycode by changing two for-loop and if-statement to the applyfunction.
Here under is a reproducible example of my code. The overall purpose (in this example) is to count the number of positive and negative values in "c" column for each value of x and y. In real case I have over 150 files to read.
# Example of initial data set
df1 <- data.frame(a=rep(c(1:5),times=3),b=rep(c(1:3),each=5),c=rnorm(15))
# Another dataframe to keep track of "c" counts
dfOcc <- data.frame(a=rep(c(1:5),times=3),"positive"=c(0),"negative"=c(0))
So far I did this code, which works but is really slow:
for (i in 1:nrow(df)) {
x = df[i,"a"]
y = df[i,"b"]
if (df[i,"c"]>=0) {
dfOcc[which(dfOcc$a==x && dfOcc$b==y),"positive"] <- dfOcc[which(dfOcc$a==x && dfOcc$b==y),"positive"] +1
}else{
dfOcc[which(dfOcc$a==x && dfOcc$b==y),"negative"] <- dfOcc[which(dfOcc$a==x && dfOcc$b==y),"negative"] +1
}
}
I am unsure whether the code is slow due to the size of the files (260k rows each) or due to the for-loop?
So far I managed to improve it in this way:
dfOcc[which(dfOcc$a==df$a & dfOcc$b==df$b),"positive"] <- apply(df,1,function(x){ifelse(x["c"]>0,1,0)})
This works fine in this example but not in my real case:
It only keeps count of the positive c and running this code twice might be counter productive
My original datasets are 260k rows while my "tracer" is 10k rows (the initial dataset repeats the a and b values with other c values
Any tip on how to improve those two points would be greatly appreciated!
I think you can simply count and spread the data. This will be easier and will work on any group and dataset. You can change group_by(a) to group_by(a, b) if you want to count grouping both a and b column.
library(dplyr)
library(tidyr)
df1 %>%
group_by(a) %>%
mutate(sign = ifelse(c > 0, "Positive", "Negative")) %>%
count(sign) %>%
spread(sign, n)
package data.table might help you do this in one line.
df1 <- data.table(data.frame(a=rep(c(1:5),times=3),b=rep(c(1:3),each=5),c=rnorm(15)))
posneg <- c("positive" , "negative") # list of columns needed
df1[,(posneg) := list(ifelse(c>0, 1,0), ifelse(c<0, 1,0))] # use list to combine the 2 ifelse conditions
for more information , try
?data.table
if you really want the positive negative counts to be in a separate dataframe,
dfOcc <- df1[,c("a", "positive","negative")]