I have a dataframe with several numeric variables along with factors. I wish to run over the numeric variables and replace the negative values to missing. I couldn't do that.
My alternative idea was to write a function that gets a dataframe and a variable, and does it. It didn't work either.
My code is:
NegativeToMissing = function(df,var)
{
df$var[df$var < 0] = NA
}
Error in $<-.data.frame(`*tmp*`, "var", value = logical(0)) : replacement has 0 rows, data has 40
what am I doing wrong ?
Thank you.
Here is an example with some dummy data.
df1 <- data.frame(col1 = c(-1, 1, 2, 0, -3),
col2 = 1:5,
col3 = LETTERS[1:5])
df1
# col1 col2 col3
#1 -1 1 A
#2 1 2 B
#3 2 3 C
#4 0 4 D
#5 -3 5 E
Now find columns that are numeric
numeric_cols <- sapply(df1, is.numeric)
And replace negative values
df1[numeric_cols] <- lapply(df1[numeric_cols], function(x) replace(x, x < 0 , NA))
df1
# col1 col2 col3
#1 NA 1 A
#2 1 2 B
#3 2 3 C
#4 0 4 D
#5 NA 5 E
You could also do
df1[df1 < 0] <- NA
With tidyverse, we can make use of mutate_if
library(tidyverse)
df1 %>%
mutate_if(is.numeric, funs(replace(., . < 0, NA)))
If you still want to change only one selected variable a solution withdplyr would be to use non-standard evaluation:
library(dplyr)
NegativeToMissing <- function(df, var) {
quo_var = quo_name(var)
df %>%
mutate(!!quo_var := ifelse(!!var < 0, NA, !!var))
}
NegativeToMissing(data, var=quo(val1)) # use quo() function without ""
# val1 val2
# 1 1 1
# 2 NA 2
# 3 2 3
Data used:
data <- data.frame(val1 = c(1, -1, 2),
val2 = 1:3)
data
# val1 val2
# 1 1 1
# 2 -1 2
# 3 2 3
Related
I have these two dataframes (imagine them very big) :
df = data.frame(subjects = 1:10,
var1 = c('a',NA,'b',NA,'c',NA,'d','e','f','g'))
g = data.frame(subjects = c(1,3,5,7,8,9,10),
score = c(1,2,1,3,2,4,1) )
and I want to put the variable score from the g dataframe into the df dataframe, with the condition that if var1 = NA, then the score in df will be equal to NA. How can we make that with a simple function ? thanks.
Second scenario :
df = data.frame(subjects = 1:10,
var1 = c('a','e','b','c','c','b','d','e','f','g'))
g = data.frame(subjects = c(1,3,5,7,8,9,10),
score = c(1,2,1,3,2,4,1) )
now I want that the score for each subject that was not calculated to be NAs to become as follows :
df = data.frame(subjects = 1:10,
var1 = c('a','e','b','c','c','b','d','e','f','g'),
score = c(1,NA,2,NA,1,NA,3,2,4,1))
We could do a join by 'subjects' which return 'score' with NA where there are no corresponding 'subject's in 'g'. If we need the 'score' to be NA also when 'var1' is NA, do a replace on the next step with NA check on 'var1'
library(dplyr)
df <- left_join(df, g, by= "subjects") %>%
mutate(score = replace(score, is.na(var1), NA))
-output
df
subjects var1 score
1 1 a 1
2 2 e NA
3 3 b 2
4 4 c NA
5 5 c 1
6 6 b NA
7 7 d 3
8 8 e 2
9 9 f 4
10 10 g 1
This question is probably best illustrated with an example.
Suppose I have a dataframe df with a binary variable b (values of b are 0 or 1). How can I take a random sample of size 10 from this dataframe so that I have 2 instances where b=0 in the random sample, and 8 instances where b=1 in the dataframe?
Right now, I know that I can do df[sample(nrow(df),10,] to get part of the answer, but that would give me a random amount of 0 and 1 instances. How can I specify a specific amount of 0 and 1 instances while still taking a random sample?
Here's an example of how I'd do this... take two samples and combine them. I've written a simple function so you can "just take one sample."
With a vector:
pop <- sample(c(0,1), 100, replace = TRUE)
yoursample <- function(pop, n_zero, n_one){
c(sample(pop[pop == 0], n_zero),
sample(pop[pop == 1], n_one))
}
yoursample(pop, n_zero = 2, n_one = 8)
[1] 0 0 1 1 1 1 1 1 1 1
Or, if you are working with a dataframe with some unique index called id:
# Where d1 is your data you are summarizing with mean and sd
dat <- data.frame(
id = 1:100,
val = sample(c(0,1), 100, replace = TRUE),
d1 = runif(100))
yoursample <- function(dat, n_zero, n_one){
c(sample(dat[dat$val == 0,"id"], n_zero),
sample(dat[dat$val == 1,"id"], n_one))
}
sample_ids <- yoursample(dat, n_zero = 2, n_one = 8)
sample_ids
mean(dat[dat$id %in% sample_ids,"d1"])
sd(dat[dat$id %in% sample_ids,"d1"])
Here is a suggestion:
First create a sample of 0 and 1 with id column.
Then sample 2:8 df's with condition and bind them together:
library(tidyverse)
set.seed(123)
df <- as_tibble(sample(0:1,size=50,replace=TRUE)) %>%
mutate(id = row_number())
df1 <- df[ sample(which (df$value ==0) ,2), ]
df2 <- df[ sample(which (df$value ==1), 8), ]
df_final <- bind_rows(df1, df2)
value id
<int> <int>
1 0 14
2 0 36
3 1 21
4 1 24
5 1 2
6 1 50
7 1 49
8 1 41
9 1 28
10 1 33
library(tidyverse)
set.seed(123)
df <- data.frame(a = letters,
b = sample(c(0,1),26,T))
bind_rows(
df %>%
filter(b == 0) %>%
sample_n(2),
df %>%
filter(b == 1) %>%
sample_n(8)
) %>%
arrange(a)
a b
1 d 1
2 g 1
3 h 1
4 l 1
5 m 1
6 o 1
7 p 0
8 q 1
9 s 0
10 v 1
I have many dataframes that contain the same data, except for a few column differences between them that I want to remove. Here's something similar to what I have:
df1 <- data.frame(X = c(1, 2, 3, 4, 5),
var1 = c('a', 'b', 'c', 'd', 'e'),
var2 = c(1, 1, 0, 0, 1))
df2 <- data.frame(X..x = c(1, 2, 3, 4, 5),
X..y = c(1, 2, 3, 4, 5),
var1 = c('f', 'g', 'h', 'i', 'j'),
var2 = c(0, 1, 0, 1, 1))
df_list <- list(df1=df1,df2=df2)
I am trying to create a function to remove the X, X..x, and X..y columns from each of the dataframes. Here's what I've tried with the given error:
remove_col <- function(df){
df = subset(df, select = -c(X, X..x, X..y))
return(df)
}
df_list <- lapply(df_list, remove_col)
# Error in eval(substitute(select), nl, parent.frame()) :
# object 'X..x' not found
I'm running into problems because not all dataframes contain X, and similarly not all dataframes contain X..x and X..y. How can I update the function so that it can be applied to all dataframes in the list and successfully remove its given columns?
Using R version 3.5.1, Mac OS X 10.13.6
You can try:
#Function
remove_col <- function(df,name){
vec <- which(names(df) %in% name)
df = df[,-vec]
return(df)
}
df_list <- lapply(df_list, remove_col,name=c('X', 'X..x', 'X..y'))
$df1
var1 var2
1 a 1
2 b 1
3 c 0
4 d 0
5 e 1
$df2
var1 var2
1 f 0
2 g 1
3 h 0
4 i 1
5 j 1
if you want to keep only the columns with "var"
lapply(df_list, function(x) x[grepl("var",colnames(x))])
or if you really just want those removed explecitly
lapply(df_list, function(x) x[!grepl("^X$|^X\\.\\.x$|^X\\.\\.y$",colnames(x))])
$df1
var1 var2
1 a 1
2 b 1
3 c 0
4 d 0
5 e 1
$df2
var1 var2
1 f 0
2 g 1
3 h 0
4 i 1
5 j 1
Instead of checking each list element for the same column names, it can be automated if we can extract the intersecting column names across the list. Loop over the list, get the column names, find the intersecting elements with Reduce and use that to subset the columns
nm1 <- Reduce(intersect, lapply(df_list, names))
lapply(df_list, `[`, nm1)
#$df1
# var1 var2
#1 a 1
#2 b 1
#3 c 0
#4 d 0
#5 e 1
#$df2
# var1 var2
#1 f 0
#2 g 1
#3 h 0
#4 i 1
#5 j 1
Or with tidyverse
library(dplyr)
library(purrr)
map(df_list, names) %>%
reduce(intersect) %>%
map(df_list, select, .)
I have filtered data and one of the columns has 5 factor levels and I want to get sum for each of the factor level.
I am using the below code
levels(df_Temp$ATYPE)
[1] "a" "b" "c" "d" "Unknown"
I am using the below code
cast(df_Temp,ATYPE~AFTER_ADM, sum, value = "CHRGES")
but the output I am getting is as below
ATYPE 0 1
1 a 0 2368968.39
2 b 0 3206567.47
3 c 0 19551.19
4 e 0 2528688.12
I want to all the factor levels and sum as "0" for those missing data of factors level.
So the desired output is
ATYPE 0 1
1 a 0 2368968.39
2 b 0 3206567.47
3 c 0 19551.19
4 d 0 0
5 e 0 2528688.12
Using xtabs from base R
xtabs(CHRGES ~ ATYPE + AFTER_ADM, subset(df_Temp, ATYPE != "e"))
# AFTER_ADM
#ATYPE 0 1
# a 0.00000000 -5.92270971
# b -1.68910431 0.05222349
# c -0.26869311 0.16922669
# d 1.44764443 -1.59011411
# e 0.00000000 0.00000000
data
set.seed(24)
df_Temp <- data.frame(ATYPE = sample(letters[1:5], 20, replace = TRUE),
AFTER_ADM = sample(0:1, 20, replace = TRUE), CHRGES = rnorm(20))
If I understand your question correctly, you can use dplyr. First I created an example dataset:
set.seed(123)
x <- sample(letters[1:5], 1e3, replace = T)
x[x == "e"] <- "Unknown"
y <- sample(1:100, 1e3, replace = T)
df1 <- data.frame(ATYPE = factor(x), AFTER_ADM = y)
df1$AFTER_ADM[df1$ATYPE == "Unknown"] <- NA
head(df1, 10)
ATYPE AFTER_ADM
1 b 28
2 d 60
3 c 17
4 Unknown NA
5 Unknown NA
6 a 48
7 c 78
8 Unknown NA
9 c 7
10 c 45
And then use group_by and summarise to get the sum and the counts. I was not sure if you would want the counts for the factor levels but it is easy to take out if you are not interested:
library(dplyr)
df1 %>%
group_by(ATYPE) %>%
summarise(sum_AFTER_ADM = sum(AFTER_ADM, na.rm = T),
n_ATYPE = n())
# A tibble: 5 x 3
ATYPE sum_AFTER_ADM n_ATYPE
<fct> <int> <int>
1 a 10363 198
2 b 11226 206
3 c 9611 203
4 d 9483 195
5 Unknown 0 198
Another possible solution using dplyr and tidyr. Using count and complete from the two packages will help solve your problem.
library(dplyr)
library(tidyr)
#using iris as toy data
iris2 <- iris %>%
filter(Species != "setosa")
#count data and then fill n with 0
ir3 <- count(iris2, Species) %>%
complete(Species, fill = list(n =0))
I have a data frame with some NA values. I need the sum of two of the columns. If a value is NA, I need to treat it as zero.
a b c d
1 2 3 4
5 NA 7 8
Column e should be the sum of b and c:
e
5
7
I have tried a lot of things, and done two dozen searches with no luck. It seems like a simple problem. Any help would be appreciated!
dat$e <- rowSums(dat[,c("b", "c")], na.rm=TRUE)
dat
# a b c d e
# 1 1 2 3 4 5
# 2 5 NA 7 8 7
dplyr solution, taken from here:
library(dplyr)
dat %>%
rowwise() %>%
mutate(e = sum(b, c, na.rm = TRUE))
Here is another solution, with concatenated ifelse():
dat$e <- ifelse(is.na(dat$b) & is.na(dat$c), dat$e <-0, ifelse(is.na(dat$b), dat$e <- 0 + dat$c, dat$b + dat$c))
# a b c d e
#1 1 2 3 4 5
#2 5 NA 7 8 7
Edit, here is another solution that uses with as suggested by #kasterma in the comments, this is much more readable and straightforward:
dat$e <- with(dat, ifelse(is.na(b) & is.na(c ), 0, ifelse(is.na(b), 0 + c, b + c)))
if you want to keep NA if both columns has it you can use:
Data, sample:
dt <- data.table(x = sample(c(NA, 1, 2, 3), 100, replace = T), y = sample(c(NA, 1, 2, 3), 100, replace = T))
Solution:
dt[, z := ifelse(is.na(x) & is.na(y), NA_real_, rowSums(.SD, na.rm = T)), .SDcols = c("x", "y")]
(the data.table way)
I hope that it may help you
Some cases you have a few columns that are not numeric. This approach will serve you both.
Note that: c_across() for dplyr version 1.0.0 and later
df <- data.frame(
TEXT = c("text1", "text2"), a = c(1,5), b = c(2, NA), c = c(3,7), d = c(4,8))
df2 <- df %>%
rowwise() %>%
mutate(e = sum(c_across(a:d), na.rm = TRUE))
# A tibble: 2 x 6
# Rowwise:
# TEXT a b c d e
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 text1 1 2 3 4 10
# 2 text2 5 NA 7 8 20