I will start by saying that I am fully aware that similar questions have been answered before, but after hours of reading and troubleshooting, I believe I have a unique issue. Apologies if I have missed something. The answer given in the much up-voted similar question points to NAs in the data, but as explained in my question, I do not seem to have any nor do I know where they may be popping up.
I am running a for-loop in R 4.1.2 using the lubridate, readr, and dplyr packages that seeks to mark as invalid data taken by individuals before they have passed a reliability test. Tests are unique to specific groups, so an individual may be reliable for one group, many, all, or none. The function I've written is meant to take a dataframe "x" and for each individual observer, check that the data point is valid against a dataframe "key" that has a column of observers (observer), test pass date (begin_valid), and the group they are now valid for (group_valid). The key may have multiple rows per observer if they have passed multiple tests. I've used tools from the Lubridate package to create POSIXct values for the dates that can be arithmetically manipulated and compared to each other. The user can set y = "remove" if they want to remove invalid data, or leave if they want to label and keep invalid data. Here is the code:
invalidata <- function(x, y){
library(lubridate)
library(readr)
library(dplyr)
x$valid <- rep(1, length(rownames(x)))
alts <- 0
key <- read_csv("updatable csv file")
key$begin_valid <- parse_date_time(key$begin_valid, c("mdy", "dmy", "ydm", "mdy"), tz= "Africa/Lubumbashi")
for(i in unique(x$observer)){
subkey <- subset(key, key$observer == i)
subx <- subset(x, x$observer == i)
if(is.na(subkey$begin_valid) == TRUE || is.na(subkey$group_valid) == TRUE){ #if reliable for nothing, remove
x[x$observer == i]$valid <- 0
print("removed completely unreliable")
}else{
for(j in rownames(subx)){
if(subx$group[j] %in% subkey$group_valid == FALSE && "All" %in% subkey$group_valid == FALSE){ #if not reliable for specific group or all groups, remove
x$valid[j] <- 0
print("removed unreliable for group")
}
if(subx$group[j] %in% subkey$group_valid){ #remove if before reliability date for group
if(subx$date[j] < subset(subkey, subkey$group_valid == subx$group[j])$begin_valid){
x$valid[j] <- 0
print("removed pre-reliability")
}
} else{ #remove if not reliable for specific group, and before reliability date for all
if(subx$date[j] < subset(subkey, subkey$group_valid == "All")$begin_valid){
x$valid[j] <- 0
print("removed pre-reliability")
}
}
}
}
}
if(y == "remove"){ #remove all invalid data and validity column
x <- subset(x, x$valid == 1)
x <- select(x, -valid)
}
return(x)}
My issue is with the line
if(subx$date[j] < subset(subkey, subkey$group_valid == "All")$begin_valid)
which returns the error:
Error in if (subx$date[j] < subset(subkey, subkey$group_valid == >"All")$begin_valid) { :
missing value where TRUE/FALSE needed
However, when I run the code inside the parentheses
subx$date[j] < subset(subkey, subkey$group_valid == "All")$begin_valid
outside of the context of the loop, I receive either a TRUE or FALSE value as relevant. I've checked all dates for any NULL or NA values, as well as addressed any data with NAs in a previous step of the code:
if(is.na(subkey$begin_valid) == TRUE || is.na(subkey$group_valid) == TRUE){}
else{ #code at issue }
I am not having issues with this very similar line:
if(subx$date[j] < subset(subkey, subkey$group_valid == subx$group[j])$begin_valid){
My best guess is that something may be going wrong with the date formatting? I know that this error is usually a symptom of NULLs or NAs floating in the data, but for the life of me I cannot figure out where they could be coming from. Dates in "x" have already been parsed and contain no NAs or NULLs. I have not included the data as it is proprietary, but I can come up with mock data if people are interested/think it would be necessary. Thank you in advance for reading through and for any thoughts/troubleshooting suggestions!
MRE:
dput output for x:
structure(list(date = structure(c(1486764000, 1486764000, 1486850400,
1486936800, 1487023200, 1487109600, 1487109600, 1487196000, 1487196000,
1487368800, 1487368800, 1487368800, 1487368800, 1487368800, 1487368800,
1487455200, 1487455200, 1487455200, 1487541600, 1487887200), class = c("POSIXct",
"POSIXt"), tzone = "Africa/Lubumbashi"), time = structure(c(23734,
53419, 41352, 33034, 24220, 34812, 35624, 27949, 27950, 49192,
49286, 49392, 49401, 62719, 62725, 26046, 26047, 27246, 46611,
61228), class = c("hms", "difftime"), units = "secs"), observer = c("MA",
"LE", "VI", "VI", "MI", "MA", "MA", "ME", "VI", "BA", "MA", "BA",
"MA", "ME", "MI", "MA", "BA", "MI", "BA", "MA"), group = c("EKK",
"EKK", "KKL", "EKK", "KKL", "KKL", "KKL", "EKK", "EKK", "EKK",
"EKK", "EKK", "EKK", "KKL", "KKL", "EKK", "EKK", "KKL", "EKK",
"KKL")), row.names = c(NA, -20L), spec = structure(list(cols = list(
date = structure(list(), class = c("collector_character",
"collector")), time = structure(list(format = ""), class = c("collector_time",
"collector")), observer = structure(list(), class = c("collector_character",
"collector")), group = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), problems = <pointer: 0x000001f6f2f7af70>, class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
for the key:
structure(list(observer = c("BA", "MI", "VI", "ME", "DA", "OK",
"FR", "MA", "LA", "DE", "JD", "JD", "JD", "BR", "DA", "DA", "PA",
"PA", "JA", "JE", "DI", "JP", "LE", "MR", "NG", "TR", "TE"),
begin_valid = c("8/12/2016", "12/21/2019", "8/11/2016", "8/11/2016",
"12/11/2019", "12/17/2019", "12/11/2019", "11/2/2016", "1/11/2020",
"12/12/2019", "12/16/2019", "12/16/2019", "11/22/2020", "6/19/2021",
"11/26/2020", "11/26/2020", "7/25/2021", "7/25/2021", NA,
NA, NA, NA, NA, NA, NA, NA, NA), group_valid = c("All", "All",
"All", "All", "All", "All", "FKK", "All", "FKK", "FKK", "EKK",
"KKL", "All", "EKK", "EKK", "KKL", "EKK", "KKL", NA, NA,
NA, NA, NA, NA, NA, NA, NA), subgroup = c(NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, "S", NA, NA, NA, "S", NA, "N",
NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, -27L
), spec = structure(list(cols = list(observer = structure(list(), class = c("collector_character",
"collector")), begin_valid = structure(list(), class = c("collector_character",
"collector")), group_valid = structure(list(), class = c("collector_character",
"collector")), subgroup = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
Two errors in this code:
Because rownames(.) returns strings, you cannot use subx$group[j]. Two options:
Preferred. Use for (j in seq_len(nrow(subx))), and all of the references work without change.
Keep for(j in rownames(subx)), but change all subx$ references to be akin to subx[j,"group"].
x[x$observer == i]$valid is wrong code, change to x$valid[x$observer == i].
After those two changes, your code runs without error, and in this example prints "removed pre-reliability" four times on the console.
When troubleshooting, you cannot intermingle subx$group[1] and subx$group["1"], they are very different, and the latter (as expected) will produce NA.
I am using the following code to determine if any of the columns in my data table have 1065. If any of the columns do have 1065, I get "TRUE" which works perfectly. Now I want to only output true if any of the columns notcancer0:notcancer33 contains 1065 AND all the rest are NA. Other columns may contain other values like 1064, 1066, etc. But I want to output "TRUE" for the rows where there is only 1065 and all the rest of the columns contain NAs for that row. What is the best way to do this?
biobank_nsaid[, ischemia1 := Reduce(`|`, lapply(.SD, `==`, "1065")), .SDcols=notcancer0:notcancer33]
Sample data:
biobank_nsaid = structure(list(aspirin = structure(c(2L, 1L, 1L, 1L), .Label =
c("FALSE", "TRUE"), class = "factor"), aspirinonly = c(TRUE, FALSE, FALSE,
FALSE), med0 = c(1140922174L, 1140871050L, 1140879616L, 1140909674L ), med1 =
c(1140868226L, 1140876592L, 1140869180L, NA), med2 = c(1140879464L, NA,
1140865016L, NA), med3 = c(1140879428L, NA, NA, NA)), row.names = c(NA, -4L),
class = c("data.table", "data.frame"))
Here are 2 options:
setDT(biobank_nsaid)[, ischemia1 :=
rowSums(is.na(.SD))==ncol(.SD)-1L & rowSums(.SD==1140909674, na.rm=TRUE)==1L,
.SDcols=med0:med3]
Or after some boolean manipulations:
biobank_nsaid[, ic2 :=
!(rowSums(is.na(.SD))!=ncol(.SD)-1L | rowSums(.SD==1140909674, na.rm=TRUE)!=1L),
.SDcols=med0:med3]
What I want to do is generate a new column in a dataframe that meets these conditions:
dataframe1$var1 == dataframe2$var1 &
dataframe1$var2 == dataframe2$var2 &
dataframe1var3 == dataframe3$var3*
Basically I need to generate a dummy variable that has the value 1 if the conditions are met, and the value 0 if they are not.
I've tried the following code that doesn't work:
dataframe1$NewVar <- ifelse(dataframe1$var1 == dataframe2$var1 &
dataframe1$var2 == dataframe2$var2 & dataframe1$var3 == dataframe2$var3 , 1, 0)
Data
dput(df1)
structure(list(var1 = c("A", "B", "C"), var2 = c("X", "X", "X"
), var3 = c(1, 2, 2)), .Names = c("var1", "var2", "var3"), row.names = c(NA,
-3L), class = "data.frame")
dput(df2)
structure(list(var1 = c("A", "A", "C"), var2 = c("X", "X", "Y"
), var3 = c(1, 1, 1)), .Names = c("var1", "var2", "var3"), row.names = c(NA,
-3L), class = "data.frame")
btw my dataset is not as simple as the example I posted in the pictures.
I don't know if it's relevant but values in my variables (columns) would look like this:
var1: 24000000000
var2: 1234567
var3: 8
You can simply do,
as.integer(rowSums(df1 == df2) == ncol(df1))
#[1] 1 0 0
Here is some sample data for which I want to encode the gender of the names over time:
names_to_encode <- structure(list(names = structure(c(2L, 2L, 1L, 1L, 3L, 3L), .Label = c("jane", "john", "madison"), class = "factor"), year = c(1890, 1990, 1890, 1990, 1890, 2012)), .Names = c("names", "year"), row.names = c(NA, -6L), class = "data.frame")
Here is a minimal set of the Social Security data, limited to just those names from 1890 and 1990:
ssa_demo <- structure(list(name = c("jane", "jane", "john", "john", "madison", "madison"), year = c(1890L, 1990L, 1890L, 1990L, 1890L, 1990L), female = c(372, 771, 56, 81, 0, 1407), male = c(0, 8, 8502, 29066, 14, 145)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("name", "year", "female", "male"))
I've defined a function which subsets the Social Security data given a year or range of years. In other words, it calculates whether a name was male or female over a given time period by figuring out the proportion of male and female births with that name. Here is the function along with a helper function:
require(plyr)
require(dplyr)
select_ssa <- function(years) {
# If we get only one year (1890) convert it to a range of years (1890-1890)
if (length(years) == 1) years <- c(years, years)
# Calculate the male and female proportions for the given range of years
ssa_select <- ssa_demo %.%
filter(year >= years[1], year <= years[2]) %.%
group_by(name) %.%
summarise(female = sum(female),
male = sum(male)) %.%
mutate(proportion_male = round((male / (male + female)), digits = 4),
proportion_female = round((female / (male + female)), digits = 4)) %.%
mutate(gender = sapply(proportion_female, male_or_female))
return(ssa_select)
}
# Helper function to determine whether a name is male or female in a given year
male_or_female <- function(proportion_female) {
if (proportion_female > 0.5) {
return("female")
} else if(proportion_female == 0.5000) {
return("either")
} else {
return("male")
}
}
Now what I want to do is use plyr, specifically ddply, to subset the data to be encoded by year, and merge each of those pieces with the value returned by the select_ssa function. This is the code I have.
ddply(names_to_encode, .(year), merge, y = select_ssa(year), by.x = "names", by.y = "name", all.x = TRUE)
When calling select_ssa(year), this command works just fine if I hard code a value like 1890 as the argument to the function. But when I try to pass it the current value for year that ddply is working with, I get an error message:
Error in filter_impl(.data, dots(...), environment()) :
(list) object cannot be coerced to type 'integer'
How can I pass the current value of year on to ddply?
I think you're making things too complicated by trying to do a join inside ddply. If I were to use dplyr I would probably do something more like this:
names_to_encode <- structure(list(name = structure(c(2L, 2L, 1L, 1L, 3L, 3L), .Label = c("jane", "john", "madison"), class = "factor"), year = c(1890, 1990, 1890, 1990, 1890, 2012)), .Names = c("name", "year"), row.names = c(NA, -6L), class = "data.frame")
ssa_demo <- structure(list(name = c("jane", "jane", "john", "john", "madison", "madison"), year = c(1890L, 1990L, 1890L, 1990L, 1890L, 1990L), female = c(372, 771, 56, 81, 0, 1407), male = c(0, 8, 8502, 29066, 14, 145)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("name", "year", "female", "male"))
names_to_encode$name <- as.character(names_to_encode$name)
names_to_encode$year <- as.integer(names_to_encode$year)
tmp <- left_join(ssa_demo,names_to_encode) %.%
group_by(year,name) %.%
summarise(female = sum(female),
male = sum(male)) %.%
mutate(proportion_male = round((male / (male + female)), digits = 4),
proportion_female = round((female / (male + female)), digits = 4)) %.%
mutate(gender = ifelse(proportion_female == 0.5,"either",
ifelse(proportion_female > 0.5,"female","male")))
Note that 0.1.1 is still a little finicky about the types of join columns, so I had to convert them. I think I saw some activity on github that suggested that was either fixed in the dev version, or at least something they're working on.