I have a question similar to this one, but my dataset is a bit bigger: 50 columns with 1 column as UID and other columns carrying either TRUE or NA, I want to change all the NA to FALSE, but I don't want to use explicit loop.
Can plyr do the trick? Thanks.
UPDATE #1
Thanks for quick reply, but what if my dataset is like below:
df <- data.frame(
id = c(rep(1:19),NA),
x1 = sample(c(NA,TRUE), 20, replace = TRUE),
x2 = sample(c(NA,TRUE), 20, replace = TRUE)
)
I only want X1 and X2 to be processed, how can this be done?
If you want to do the replacement for a subset of variables, you can still use the is.na(*) <- trick, as follows:
df[c("x1", "x2")][is.na(df[c("x1", "x2")])] <- FALSE
IMO using temporary variables makes the logic easier to follow:
vars.to.replace <- c("x1", "x2")
df2 <- df[vars.to.replace]
df2[is.na(df2)] <- FALSE
df[vars.to.replace] <- df2
tidyr::replace_na excellent function.
df %>%
replace_na(list(x1 = FALSE, x2 = FALSE))
This is such a great quick fix. the only trick is you make a list of the columns you want to change.
Try this code:
df <- data.frame(
id = c(rep(1:19), NA),
x1 = sample(c(NA, TRUE), 20, replace = TRUE),
x2 = sample(c(NA, TRUE), 20, replace = TRUE)
)
replace(df, is.na(df), FALSE)
UPDATED for an another solution.
df2 <- df <- data.frame(
id = c(rep(1:19), NA),
x1 = sample(c(NA, TRUE), 20, replace = TRUE),
x2 = sample(c(NA, TRUE), 20, replace = TRUE)
)
df2[names(df) == "id"] <- FALSE
df2[names(df) != "id"] <- TRUE
replace(df, is.na(df) & df2, FALSE)
You can use the NAToUnknown function in the gdata package
df[,c('x1', 'x2')] = gdata::NAToUnknown(df[,c('x1', 'x2')], unknown = 'FALSE')
With dplyr you could also do
df %>% mutate_each(funs(replace(., is.na(.), F)), x1, x2)
It is a bit less readable compared to just using replace() but more generic as it allows to select the columns to be transformed. This solution especially applies if you want to keep NAs in some columns but want to get rid of NAs in others.
Related
I need to label a values in a lot of variables with sjlabelled::set_labels. Here is a reproducable example and what already works:
library(data.table)
library(sjlabelled)
lookup <- data.table(id = paste0("q", 1:5),
answers = paste(paste0("atext", 1:5), paste0("btext", 1:5)
, paste0("ctext", 1:5), sep = ";"))
data <- data.table(q1 = sample(1:3, 10, replace = TRUE),
q2 = sample(1:3, 10, replace = TRUE),
q3 = sample(1:3, 10, replace = TRUE),
q4 = sample(1:3, 10, replace = TRUE),
q5 = sample(1:3, 10, replace = TRUE))
data$q1 <- set_labels(data$q1, labels = unlist(strsplit(lookup[id == "q1", answers], split = ";")))
get_labels(data$q1)
So the labels for the different answers (=values) are seperated by a semicolon. I am able to make it work if I call the variables by id but as you can see in the example code but I am struggling with the task if I want to "loop" through all variables.
The goal is to be able to export the datatable (or dataframe) as an SPSS file. If it works with other packages I would also be happy.
Match the column names of data with id, split the answers on ; and pass the labels as a list.
library(sjlabelled)
data <- set_labels(data, labels = strsplit(lookup$answers[match(names(data), lookup$id)], ';'))
get_labels(data)
#$q1
#[1] "atext1" "btext1" "ctext1"
#$q2
#[1] "atext2" "btext2" "ctext2"
#$q3
#[1] "atext3" "btext3" "ctext3"
#$q4
#[1] "atext4" "btext4" "ctext4"
#$q5
#[1] "atext5" "btext5" "ctext5"
I have recently come across the package called skimr which helps create useful summary statistics. I have written the following codes to extract summary stats only on numerical columns. My first question is, is there a more direct way that skimr permits to specify the type of variables for which I want summary stats? My second question is, what does append == TRUE actually achieve when I write the my_skim "closure"?
library(skimr)
library(dplyr)
### Creating an example dataset
test.df1 <- data.frame("Year" = sample(2018:2020, 20, replace = TRUE),
"Firm" = head(LETTERS, 5),
"Exporter"= sample(c("Yes", "No"), 20, replace = TRUE),
"Revenue" = sample(100:200, 20, replace = TRUE),
stringsAsFactors = FALSE)
test.df1 <- rbind(test.df1,
data.frame("Year" = c(2018, 2018),
"Firm" = c("Y", "Z"),
"Exporter" = c("Yes", "No"),
"Revenue" = c(NA, NA)))
test.df1 <- test.df1 %>% mutate(Profit = Revenue - sample(20:30, 22, replace = TRUE ))
### Using skimr package to extract summary stats
my_skim <- skim_with(numeric = sfl(minimum = min, maximum = max, hist = NULL), append = TRUE)
test.df1_skim1 <- test.df1 %>%
group_by(Year) %>%
my_skim() %>%
filter (skim_type != "character") %>%
select(-starts_with("character"))
If you only want summary of the numeric variables you could set all the other types to NULL or else you could run the skim and use yank() to get subtable for a type.
From https://docs.ropensci.org/skimr/articles/skimr.html#reshaping-the-results-from-skim-
skim(Orange) %>% yank("numeric")
The append option lets you either replace the default statistics or append to the defaults.
I'd like to compare element by element from two data.frame called df1 and df2. From they, I'd like to build a new data.frame called out. If the elements are equals, then the element in out is 1, otherwise is 0.
MWE
set.seed(1)
df1 <- data.frame(Q1 = sample(letters[1:5], 2, replace = TRUE),
Q2 = sample(letters[1:5], 2, replace = TRUE))
set.seed(2)
df2 <- data.frame(Q1 = sample(letters[1:5], 2, replace = TRUE),
Q2 = sample(letters[1:5], 2, replace = TRUE))
Expected output
out <- data.frame(Q1 = c(0, 0), Q2 = c(1, 0))
If the datasets are created with stringsAsFactors = FALSE while creating the data.frame - factor makes it difficult as the attributes would create difficulty in doing the comparison)
+(df1 == df2)
Or if it is factor convert to character columns with type.convert
+(type.convert(df1, as.is = TRUE) == type.convert(df2, as.is = TRUE))
Or make use of matrix hack way of changing to character
+(as.matrix(df1) == as.matrix(df2))
I know I can do this in other ways, but I am just curious.
dfDice = sample(1:6, 10000, replace = TRUE) %>%
data.frame()
The above creates a data.frame, where the column header is called '.'.
So my first question is can I pipe the column header into my code? I have tried putting it in my data.frame() function but it just creates a new column.
And my second question is, can I pipe multiple columns into a data.frame, or would I have to do something like this?:
dfDice = (num = sample(1:6, 10000, replace = TRUE) %>%
data.frame(letters = sample(LETTERS, 10000, replace = TRUE))
Again, I know this is not the best way to create a data.frame, I am just curious from a learning perspective and trying to fully understand piping.
So my first question is can I pipe the column header into my code? I
have tried putting it in my data.frame() function but it just creates
a new column.
For single columns you have two options:
dfDice <- sample(1:6, 10000, replace = TRUE) %>%
data.frame() %>%
setNnames("num")
dfDice <- sample(1:6, 10000, replace = TRUE) %>%
data.frame(num = .)
And my second question is, can I pipe multiple columns into a
data.frame?
sample(1:6, 5, replace = TRUE) %>%
cbind(sample(LETTERS, 5, replace = TRUE)) %>%
as.data.frame() %>%
setNames(c("num", "letters"))
To assign names, you can use a predefined vector and use setNames
library(dplyr)
cols <- "a"
sample(1:6, 10, replace = TRUE) %>%
data.frame() %>%
setNames(cols)
Or can also name dynamically without knowing number of columns beforehand.
sample(1:6, 10, replace = TRUE) %>%
data.frame() %>%
setNames(letters[seq_along(.)])
For 2nd question simplest option would be
data.frame(a = sample(1:6, 10, replace = TRUE),
b = sample(LETTERS, 10, replace = TRUE))
OR if you want to use piping maybe ?
sample(1:6, 10, replace = TRUE) %>%
data.frame() %>%
setNames(cols) %>%
bind_cols(b = sample(LETTERS, 10, replace = TRUE))
Suppose my data looks something like this (with many more columns)
set.seed(112116)
df <- data.frame(x1 = sample(c(LETTERS, -1:-10), 100, replace = T),
x2 = sample(c(letters, -1:-10), 100, replace = T),
x3 = sample(c(1:30, -1:-10), 100, replace = T))
I want to replace all negative numbers with NA. I can do it one by one like this:
df <- df %>% mutate(x1 = replace(x1, which(x1<0), NA),
x2 = replace(x2, which(x2<0), NA),
x3 = replace(x3, which(x3<0), NA))
But i'm hoping that there is a way of doing this for all columns in my data
Try with mutate_each
df %>%
mutate_each(funs(replace(., .<0, NA)))