Custom data-dependent recoding to logicals in R - r

I have two data frames, data and meta. Some, but not all, columns in data are logical values, but they are coded in many different ways. The rows in meta describe the columns in data, indicate whether they are to be interpreted as logicals, and if so, what single value codes TRUE and what single value codes FALSE.
I need a procedure that replaces all data values in conceptually logical columns with the appropriate logical values from the codes in the corresponding meta row. Any data values in a conceptually logical column that do not match a value in the corresponding meta row should become NA.
Small toy example for meta:
name type false true
-----------------------------------------
a.char.var char NA NA
a.logical.var logical NA 7
another.logical.var logical 1 0
another.char.var char NA NA
Small toy example for data:
a.char.var a.logical.var another.logical.var another.char.var
----------------------------------------------------------------
aa 7 0 ba
ab NA 1 bb
ac 7 NA bc
ad 4 3 bd
Small toy example output:
a.char.var a.logical.var another.logical.var another.char.var
----------------------------------------------------------------
aa TRUE TRUE ba
ab FALSE FALSE bb
ac TRUE NA bc
ad NA NA bd
I cannot, for the life of me, find a way to do this in idiomatic R that handles all the corner cases. The data sets are large, so an idiomatic solution would be ideal if possible. I inherited this absolutely insane data management mess and will be grateful to anybody who can help fix it. I am by no means an R guru, but this seems like a deceptively difficult problem.

First we set up the data
meta <- data.frame(name=c('a.char.var', 'a.logical.var', 'another.logical.var', 'another.char.var'),
type=c('char', 'logical', 'logical', 'char'),
false=c(NA, NA, 1, NA),
true=c(NA, 7, 0, NA), stringsAsFactors = F)
data <- data.frame(a.char.var=c('aa', 'ab', 'ac', 'ad'),
a.logical.var=c(7, NA, 7, 4),
another.logical.var=c(0,1,NA,3),
another.char.var=c('ba', 'bb', 'bc', 'bd'), stringsAsFactors = F)
Then we subset out just the logical columns. We will iterate through these, using the name column to select the relevant column in data, and change values in data_out from an initialized NA to either T or F according to matching values in data.
Note that data[,logical_meta$name[1]] is equivalent to data[,'a.logical.var'] or data$a.logical.var, if logical_meta$name is a character. If it's a factor (eg if we didn't specify stringsAsFactors=F) we need to convert to character at which point we might as well give it a name - colname below.
Having NAs to contend with means using which is advantageous: c(0, 1,NA,3)==0 returns T,F,NA,F but which then ignores the NA and returns just the position 1. Subsetting by a logical vector that includes NAs yields NA rows or columns, using which eliminates this.
logical_meta <- meta[meta$type=='logical',]
data_out <- data #initialize
for(i in 1:nrow(logical_meta)) {
colname <- as.character(logical_meta$name[i]) #only need as.character if factor
data_out[,colname] <- NA
#false column first
if(is.na(logical_meta$false[i])) {
data_out[is.na(data[,colname]),colname] <- FALSE
} else {
data_out[which(data[,colname]==logical_meta$false[i]),
colname] <- FALSE
}
#true column next
if(is.na(logical_meta$true[i])) {
data_out[is.na(data[,colname]),colname] <- TRUE
} else {
data_out[which(data[,colname]==logical_meta$true[i]),
colname] <- TRUE
}
}
data_out

I've written a function that takes in the column index of data and tries to perform the operation you described.
The function first selects x as the column we are interested in. We then match the name of the column in data to the entries in the first column of meta, this gives our row of interest.
We then check if the column type is logical, if it isn't we just return x, nothing needed to be changed. If the column type is logical we then check whether its values match the true or false columns in meta.
convert_data <- function(colindex, dat, meta = meta){
x <- dat[,colindex] #select our data vector
#match the column name to the first column in meta
find_in_meta <- match(names(dat)[colindex],
meta[,1])
#what type of column is it
type_col <- meta[find_in_meta,2]
if(type_col != 'logical'){
return(x)
}else{
#fix if logical is NA
true_val <- ifelse(is.na(meta[find_in_meta,4]),'NA_val',
meta[find_in_meta,4])
#fix if logical is NA
false_val <- ifelse(is.na(meta[find_in_meta,3]), 'NA_val',
meta[find_in_meta, 3])
#fix if logical is NA
x <- ifelse(is.na(x), 'NA_val', x)
x <- ifelse(x == true_val, TRUE,
ifelse(x == false_val, FALSE, NA))
return(x)
}
}
We can then use lapply and a little data manipulation to get it into an acceptable form:
res <- lapply(1:ncol(df1), function(ind)
convert_data(colindex = ind, dat = df1, meta = meta))
setNames(do.call('cbind.data.frame', res), names(df1))
a.char.var a.logical.var another.logical.var another.char.var
1 aa TRUE TRUE ba
2 ab FALSE FALSE bb
3 ac TRUE NA bc
4 ad NA NA bd
data
meta <- structure(list(name = c("a.char.var", "a.logical.var", "another.logical.var",
"another.char.var"), type = c("char", "logical", "logical", "char"
), false = c(NA, NA, 1L, NA), true = c(NA, 7L, 0L, NA)), .Names = c("name",
"type", "false", "true"), class = "data.frame", row.names = c(NA,
-4L))
df1 <- structure(list(a.char.var = c("aa", "ab", "ac", "ad"), a.logical.var = c(7L,
NA, 7L, 4L), another.logical.var = c(0L, 1L, NA, 3L), another.char.var = c("ba",
"bb", "bc", "bd")), .Names = c("a.char.var", "a.logical.var",
"another.logical.var", "another.char.var"), class = "data.frame", row.names = c(NA,
-4L))

Related

Mutate a column and name it after the input variable for a function in R

I have a data frame in R that is 89 columns wide, and 500,000 rows long. In each of the columns there are multiple 4 digit numeric codes, they can be in any column. I want to create a function that scans across each row to see if a code exists, if it does label as 1 if not 0, the new column must be named as the code searched for or something very similar (appended letter etc), rinse and repeat for ~450 such codes. Each new column would be labelled in some way after the code that was being searched for, like the 3669 column below.
c1 c2 c3 3369
1 2255 3669 NA 1
2 NA 5555 6598 0
3 NA NA 1245 0
I have attempted to do this using mutate, and rowSums see below, which works for an individual code, but I cannot get to work when using the sapply function. It just creates a single column called "x"
a <- function(x) {
SR2 <<- SR2 %>% mutate(x = ifelse(rowSums(SR2 == x, na.rm = TRUE) > 0, 1, 0))
}
The x in this function is a list of codes, so "3369", "2255" etc.
What am I missing here?
Use quo_name with !! to get the correct column name. Use map_dfc to get the output in data frame
library(purrr)
library(dplyr)
df_out <- map_dfc(c('2255','5555'),
~transmute(df,!!quo_name(.x) := ifelse(rowSums(df == .x, na.rm = TRUE) > 0, 1, 0)))
bind_cols(df,df_out)
Data
df <- structure(list(c1 = c(2255L, NA, NA), c2 = c(3669L, 5555L, NA), c3 = c(NA, 6598L, 1245L),
`3369` = c(1L, 0L, 0L)), class = "data.frame", row.names = c("1", "2", "3"))

remove rows if value matches that which was conditionally remove in r

I have a data frame. I'm trying to remove rows that have values in a column that match other rows that were conditionally removed. Let me provide a simple example for better explaining.
I'm tried using the previous post as a starting point:
Remove Rows From Data Frame where a Row match a String
>dat
A,B,C
4,3,Foo
2,3,Bar
1,2,Bar
7,5,Zap
First remove rows with "Foo" in column C:
dat[!grepl("Foo", dat$C),]
Now I want to remove any additional rows that have values in column B that match the values in rows with Foo. So in this example, any rows with B = 3 would be removed because row 1 has Foo, which was removed and has B=3.
>dat.new
1,2,Bar
7,5,Zap
Any ideas on how to do this would be appreciated.
We subset the 'B' values where 'C' is 'Foo', create a logical vector by checking those values in the 'B', negate (!) and also create a condition where the 'C' is not "Foo"
library(dplyr)
dat.new <- dat %>%
filter(!B %in% B[C == 'Foo'], C != 'Foo')
dat.new
# A B C
#1 1 2 Bar
#2 7 5 Zap
Or in base R with subset
subset(dat, !B %in% B[C == 'Foo'] & C != "Foo")
data
dat <- structure(list(A = c(4L, 2L, 1L, 7L), B = c(3L, 3L, 2L, 5L),
C = c("Foo", "Bar", "Bar", "Zap")), row.names = c(NA, -4L
), class = "data.frame")

R get row number of the first row that has a string variable in a data frame column

I am working with data frames that are dynamically generated.
structure(list(`4` = c(NA, NA, "Location", NA), `5` = c(NA, NA,
"Size", "W")), row.names = c(NA, 4L), class = "data.frame")
The above looks like this:
4 5
1 <NA> <NA>
2 <NA> <NA>
3 Location Size
4 <NA> W
From each column in the data frame I want to get the first character variable. For example from the above table, I want to retrieve Location and Size and use them as my column header.
Since the tables are dynamically generated, I am not sure in which line the string variable would appear.
An option is to loop through the columns, get the first non-NA element with summarise_all
library(dplyr)
df1 %>%
summarise_all(funs(.[!is.na(.)][1]))
Or with sapply, use the same logic
sapply(df1, function(x) x[!is.na(x)][1])
Or with which on logical matrix (!is.na(df1)), subset the data, get the first element of each column by filtering out the duplicate column index
ind <- which(!is.na(df1), arr.ind = TRUE)
df1[ind][!duplicated(ind[,2])]
#[1] "Location" "Size"

R update if statement to add count

I'm trying to count how columns contain text per row. I have the following that tells me if all columns contain text:
df = structure(list(Participant = 1:3, A = c("char", "foo", ""), B = c("char2", 0L, 0L)), .Names = c("Participant", "A", "B"), row.names = c(NA, -3L), class = "data.frame")
df$newcolumn <- ifelse(nchar(df$A)>1 & nchar(df$B)>1, "yes", "no")
Instead of "Yes" or "No" I want a count of how many matches occur. Ideas?
Using your logic you can try something like the following:
df$newcolumn <- (nchar(df$A)>1) + (nchar(df$B)>1)
df
Participant A B newcolumn
1 1 char char2 2
2 2 foo 0 1
3 3 0 0
If we need to get the nchar per row, loop through the columns of interest, get the nchar, and use Reduce with + to get the sum per each row
df$CountNChar <- Reduce(`+`, lapply(df[-1], nchar))
Or if we need the sum of logical condition, just change the nchar to nchar(x) > 1 (with anonymous function call)
df$CountNChar <- Reduce(`+`, lapply(df[-1], function(x) nchar(x) >1))
df$CountNChar
#[1] 2 1 0
You appear to be trying to count the number of rows wheredf$A and df$B have more than one character in them. The easiest way to do this is with sum, since logical vectors can be added up just like numeric or integer. Thus, the code fragment you want is
sum(nchar(df$A)>1 & nchar(df$B)>1)
However, looking at your first sentence, you should be aware that only one type of data can exist in a column of a data frame. c("foo",0L,0L) is a vector of class "character", with elements "foo","0","0".

constraining on classes of column in data frame in R

So, I am trying to write a function with a dataframe as input to check whether the columns of the dataframe only contains integer, character(not factor) and numeric vectors. In that case I want to return value TRUE. If it contains something else, I want to return FALSE.
for example :
df1 <- data.frame( a = 1:4, b = c("x","y", "z","w"), c = 8:11, stringsAsFactors = FALSE)
df2 <- data.frame(a = 2:5, b = c("m", "n", "o", "p"),c = 11:14, stringsAsFactors = TRUE)
In this case, the function should return TRUE with input df1 since it has integer and character type columns. But for df2, I want to return FALSE since it contains factor column b.
Could someone help?
Since integers are also numeric, you can use the condition
is.numeric(x) | is.character(x)
Here's a function:
numOrChar <- function(df) {
f <- function(x) is.numeric(x) | is.character(x)
all(vapply(df, f, logical(1L)))
}
numOrChar(df1)
# [1] TRUE
numOrChar(df2)
# [1] FALSE

Resources