Removing a row based on a condition - r

my_df <- tibble(
b1 = c(2, 1, 1, 2, 2, 2, 1, 1, 2),
b2 = c(NA, 4, 6, 2, 6, 6, 1, 1, 7),
b3 = c(5, 9, 8, NA, 2, 3, 9, 5, NA),
b4 = c(NA, 6, NA, 10, 12, 8, 3, 6, 2),
b5 = c(2, 12, 1, 7, 8, 5, 5, 6, NA),
b6 = c(9, 2, 4, 6, 7, 6, 6, 7, 9),
b7 = c(1, 3, 7, 7, 4, 2, 2, 9, 5),
b8 = c(NA, 8, 4, 5, 1, 4, 1, 3, 6),
b9 = c(4, 5, NA, 9, 5, 1, 1, 2, NA),
b10 = c(14, 3, NA, 2, 2, 2, 3, NA, 5))
I have a df like this, and would like to tell R to remove all '3' or 'NA' in b10 if b1 = 1. I have tried this with this, but it seems to keep the '3' and 'NA' instead of removing them;
new_df <- my_df %>% filter(is.na(b10) | b10 == 3 | b1==1 & b10 ==NA)

This should do the trick:
library(dplyr)
my_df %>%
filter(!(b1 == 1 & (is.na(b10) | b10 == 3)))
Edit: assuming you want to remove the rows where the conditions meet

Related

variable based on other variables in R

I have a df like this
my_df <- data.frame(
b1 = c(2, 6, 3, 6, 4, 2, 1, 9, NA),
b2 = c(100, 4, 106, 102, 6, 6, 1, 1, 7),
b3 = c(75, 79, 8, 0, 2, 3, 9, 5, 80),
b4 = c(NA, 6, NA, 10, 12, 8, 3, 6, 2),
b5 = c(2, 12, 1, 7, 8, 5, 5, 6, NA),
b6 = c(9, 2, 4, 6, 7, 6, 6, 7, 9),
b7 = c(1, 3, 7, 7, 4, 2, 2, 9, 5),
b8 = c(NA, 8, 4, 5, 1, 4, 1, 3, 6),
b9 = c(4, 5, 7, 9, 5, 1, 1, 2, NA),
b10 = c(14, 2, 4, 2, 1, 1, 1, 1, 5))
I want to create a new column (NEW) which says BLUE or RED based on columns b2 and b3. so, if column b2 is Greater than or equal to 100 0R b3 is Greater than or equal to 75, then input BLUE otherwise input RED.
So that I will have something like this:
my_df <- data.frame(
b1 = c(2, 6, 3, 6, 4, 2, 1, 9, NA),
b2 = c(100, 4, 106, 102, 6, 6, 1, 1, 7),
b3 = c(75, 79, 8, 0, 2, 3, 9, 5, 80),
b4 = c(NA, 6, NA, 10, 12, 8, 3, 6, 2),
b5 = c(2, 12, 1, 7, 8, 5, 5, 6, NA),
b6 = c(9, 2, 4, 6, 7, 6, 6, 7, 9),
b7 = c(1, 3, 7, 7, 4, 2, 2, 9, 5),
b8 = c(NA, 8, 4, 5, 1, 4, 1, 3, 6),
b9 = c(4, 5, 7, 9, 5, 1, 1, 2, NA),
b10 = c(14, 2, 4, 2, 1, 1, 1, 1, 5),
NEW = c("BLUE", "BLUE", "BLUE", "BLUE", "RED", "RED", "RED", "RED", "BLUE"))
I have been able to work this out using this:
library (tidyverse)
greater_threshold <- 99.9
greater_threshold1 <- 74.9
my_df1 <- my_df %>%
mutate(NEW = case_when(b2 > greater_threshold ~ "BLUE",
b3 > greater_threshold1 ~ "BLUE",
+ T~"RED"))
At the moment, you can see that I am setting my 'greater threshold' to be slightly less than the required value. Although it works well. My question is this. Is there a way I set set my 'greater threshold to be ≥ 100 for b2 and ≥ 75 for b3.
For this example, I'd go whit if_else instead of case_when:
library(dplyr)
greater_threshold <- 100
greater_threshold1 <- 75
my_df <- data.frame(
b1 = c(2, 6, 3, 6, 4, 2, 1, 9, NA),
b2 = c(100, 4, 106, 102, 6, 6, 1, 1, 7),
b3 = c(75, 79, 8, 0, 2, 3, 9, 5, 80),
b4 = c(NA, 6, NA, 10, 12, 8, 3, 6, 2),
b5 = c(2, 12, 1, 7, 8, 5, 5, 6, NA),
b6 = c(9, 2, 4, 6, 7, 6, 6, 7, 9),
b7 = c(1, 3, 7, 7, 4, 2, 2, 9, 5),
b8 = c(NA, 8, 4, 5, 1, 4, 1, 3, 6),
b9 = c(4, 5, 7, 9, 5, 1, 1, 2, NA),
b10 = c(14, 2, 4, 2, 1, 1, 1, 1, 5)
)
my_df1 <- my_df %>%
mutate(
NEW = if_else(
b2 >= greater_threshold | b3 >= greater_threshold1,
"BLUE",
"RED"
)
)
my_df1
# b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 NEW
# 1 2 100 75 NA 2 9 1 NA 4 14 BLUE
# 2 6 4 79 6 12 2 3 8 5 2 BLUE
# 3 3 106 8 NA 1 4 7 4 7 4 BLUE
# 4 6 102 0 10 7 6 7 5 9 2 BLUE
# 5 4 6 2 12 8 7 4 1 5 1 RED
# 6 2 6 3 8 5 6 2 4 1 1 RED
# 7 1 1 9 3 5 6 2 1 1 1 RED
# 8 9 1 5 6 6 7 9 3 2 1 RED
# 9 NA 7 80 2 NA 9 5 6 NA 5 BLUE

How to sort each column of a df in descending order regarless of the row order?

I am trying to sort my data in descending or ascending order regardless of the data in the rows. I made a dummy example below:
A <- c(9,9,5,4,6,3,2,NA)
B <- c(9,5,3,4,1,4,NA,NA)
C <- c(1,4,5,6,7,4,2,4)
base <- data.frame(A,B,C)
df <- base
df$A <- sort(df$A,na.last = T)
df$B <- sort(df$B,na.last = T)
df$C <- sort(df$C)
We get this
structure(list(A = c(2, 3, 3, 4, 4, 4, 5, 5, 6, 9, 9, NA), B = c(1,
2, 3, 4, 4, 4, 5, 5, 9, 10, NA, NA), C = c(1, 2, 3, 4, 4, 4,
5, 5, 6, 7, 8, 8)), row.names = c(NA, -12L), class = "data.frame")
I want to get something similar to df but my data have hundreds of columns, is there an easier way to do it?
I tried arrange_all() but the result is not what i want.
library(tidyverse)
test <- base%>%
arrange_all()
Obtaining this:
structure(list(A = c(2, 3, 3, 4, 4, 4, 5, 5, 6, 9, 9, NA), B = c(NA,
2, 4, 4, 5, 10, 3, 4, 1, 5, 9, NA), C = c(2, 3, 4, 6, 8, 5, 5,
8, 7, 4, 1, 4)), class = "data.frame", row.names = c(NA, -12L
))
You can sort each column individually :
library(dplyr)
base %>% mutate(across(.fns = sort, na.last = TRUE))
# A B C
#1 2 1 1
#2 3 3 2
#3 4 4 4
#4 5 4 4
#5 6 5 4
#6 9 9 5
#7 9 NA 6
#8 NA NA 7
Or in base R :
base[] <- lapply(base, sort, na.last = TRUE)

Using mean imputation for different variables

I have a data set where data are missing. Here is a sample of what my data looks like:
df<-read.csv(id, test1, test2, test3
1, 9, 1, 3
2, 8, 2, NA
3, NA, 3, NA
4, 1, 3, 4
5, 2, 44, NA
6, 4, 4, 1
7, NA, NA, NA)
How would I input the respective mean of each test into the corresponding column for each NA?
Output should look like
id test1 test2 test3
1, 9, 1, 3
2, 8, 2, 2.66
3, 4.8, 3, 2.66
4, 1, 3, 4
5, 2, 44, 2.66
6, 4, 4, 1
7, 4.8, 9.5, 2.66
An option would be na.aggregate
library(zoo)
df[-1] <- na.aggregate(df[-1])

Summarise grouping by varname

I'm analysing some data, and I've come a difficulty - I don't know how to summarise my whole dataset using the variable names as the "group".
A sample data:
structure(list(x4 = c(3, 4, 7, 4, 5, 1, 5, 2, 7, 1), x5 = c(2,
4, 4, 4, 5, 3, 6, 1, 7, 1), x6 = c(3, 5, 4, 7, 5, 4, 6, 4, 6,
2), x7 = c(4, 1, 6, 4, 6, 4, 6, 2, 7, 2), x9 = c(5, 5, 4, 5,
6, 3, 7, 5, 6, 1), x10 = c(3, 6, 5, 4, 6, 5, 6, 3, 6, 1), x11 = c(6,
7, 7, 7, 6, 7, 7, 5, 7, 4), x12 = c(6, 7, 6, 7, 6, 4, 6, 6, 7,
5), x14 = c(5, 7, 5, 6, 4, 6, 6, 5, 6, 4), x15 = c(4, 7, 7, 7,
6, 4, 6, 5, 6, 1), x16 = c(4, 7, 7, 7, 6, 5, 7, 3, 6, 4), x17 = c(4,
5, 5, 7, 6, 6, 7, 4, 6, 2), x18 = c(3, 4, 7, 7, 6, 5, 6, 4, 6,
2), x19 = c(5, 7, 5, 7, 6, 6, 6, 3, 6, 1), x22 = c(4, 4, 5, 7,
6, 7, 6, 5, 6, 2), x26 = c(6, 7, 5, 4, 6, 7, 7, 4, 6, 4), x29 = c(4,
7, 2, 7, 6, 4, 7, 3, 6, 1), x33 = c(3, 7, 7, 7, 6, 5, 6, 3, 6,
3), x34 = c(5, 5, 4, 7, 6, 7, 7, 5, 6, 2), x35 = c(4, 4, 7, 7,
5, 7, 6, 4, 6, 2), x36 = c(4, 7, 6, 7, 6, 5, 5, 4, 6, 2), x37 = c(3,
4, 7, 4, 5, 4, 6, 3, 5, 2), x49 = c(4, 7, 7, 7, 6, 5, 5, 6, 6,
3), x50 = c(4, 7, 6, 5, 5, 5, 6, 5, 7, 4)), row.names = c(NA,
-10L), class = "data.frame", .Names = c("x4", "x5", "x6", "x7",
"x9", "x10", "x11", "x12", "x14", "x15", "x16", "x17", "x18",
"x19", "x22", "x26", "x29", "x33", "x34", "x35", "x36", "x37",
"x49", "x50"))
I just want some statistics, like this:
summary <- dados_afc %>%
summarise_all(funs(mean, sd, mode, median))
But the result is a df with one observation and lots of variable. I wanted it to have 5 columns: varname, mean, sd, mode, median, but I'm not sure how to do it. Any tips?
Note: I am not aware of a built-in way to get mode from R. See here for some discussion:
Is there a built-in function for finding the mode?
# From the top answer there:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
To treat each column as a group, you can use tidyr::gather to convert your "wide" data into long form, and then dplyr::group_by to create groups with their own summary calculations:
library(tidyverse)
summary <- dados_afc %>%
gather(group, value) %>%
group_by(group) %>%
summarise_all(funs(mean, sd, Mode, median))
> summary
# A tibble: 24 x 5
group mean sd Mode median
<chr> <dbl> <dbl> <dbl> <dbl>
1 x10 4.5 1.72 6 5
2 x11 6.3 1.06 7 7
3 x12 6 0.943 6 6
4 x14 5.4 0.966 6 5.5
5 x15 5.3 1.89 7 6
6 x16 5.6 1.51 7 6
7 x17 5.2 1.55 6 5.5
8 x18 5 1.70 6 5.5
9 x19 5.2 1.87 6 6
10 x22 5.2 1.55 6 5.5

Reorganize a List with Lists in a new list in R

This is my list:
mylist=list(list(a = c(2, 3, 4, 5), b = c(3, 4, 5, 5), c = c(3, 7, 5,
5), d = c(3, 4, 9, 5), e = c(3, 4, 5, 9), f = c(3, 4, 1, 9),
g = c(3, 1, 5, 9), h = c(3, 3, 5, 9), i = c(3, 17, 3, 9),
j = c(3, 17, 3, 9)), list(a = c(2, 5, 48, 4), b = c(7, 4,
5, 5), c = c(3, 7, 35, 5), d = c(3, 843, 9, 5), e = c(3, 43,
5, 9), f = c(3, 4, 31, 39), g = c(3, 1, 5, 9), h = c(3, 3, 5,
9), i = c(3, 17, 3, 9), j = c(3, 17, 3, 9)), list(a = c(2, 3,
4, 35), b = c(3, 34, 5, 5), c = c(3, 37, 5, 5), d = c(38, 4,
39, 5), e = c(3, 34, 5, 9), f = c(33, 4, 1, 9), g = c(3, 1, 5,
9), h = c(3, 3, 35, 9), i = c(3, 17, 33, 9), j = c(3, 137, 3,
9)), list(a = c(23, 3, 4, 85), b = c(3, 4, 53, 5), c = c(3, 7,
5, 5), d = c(3, 4, 9, 5), e = c(3, 4, 5, 9), f = c(3, 34, 1,
9), g = c(38, 1, 5, 9), h = c(3, 3, 5, 9), i = c(3, 137, 3, 9
), j = c(3, 17, 3, 9)), list(a = c(2, 3, 48, 5), b = c(3, 4,
5, 53), c = c(3, 73, 53, 5), d = c(3, 43, 9, 5), e = c(33, 4,
5, 9), f = c(33, 4, 13, 9), g = c(3, 81, 5, 9), h = c(3, 3, 5,
9), i = c(3, 137, 3, 9), j = c(3, 173, 3, 9)))
As you can see my list has 5 entries. Each entry has 10 others entries filled by 4 elements.
> mylist[[4]][[1]]
[1] 23 3 4 85
I want to create another list with only one entry.
All want to put all entr of tipe mylist[[i]][[1]] in first position of a new list: mynewlist[[1]][[1]] will be filled by the mylist[[1]][[1]],mylist[[2]][[1]],mylist[[3]][[1]],mylist[[4]][[1]],mylist[[5]][[1]] elements.
The secon position of mynewlist (mynewlist[[2]][[1]]) will be: mylist[[1]][[2]],mylist[[2]][[2]],mylist[[3]][[2]],mylist[[4]][[2]],mylist[[5]][[2]] elements.
Until
The fith position of mynewlist (mynewlist[[5]][[1]]) will be: mylist[[1]][[5]],mylist[[2]][[5]],mylist[[3]][[5]],mylist[[4]][[5]],mylist[[5]][[5]] elements.
In other words, I want to put every mylist[[i]][[1]]$a in the mynewlist[[1]][[1]] position; the mylist[[i]][[1]]$b in the mynewlist[[1]][[2]] position and so on until mylist[[i]][[1]]$j in the mynewlist[[1]][[10]]
This should be my output for the first position of mynewlist:
#[[1]]
#[1] 2 3 4 5
2 5 48 4
2 3 4 35
23 3 4 85
2 3 48 5
Any help?
We can use transpose
library(dplyr)
out <- mylist %>%
purrr::transpose(.)
out[[1]]
#[[1]]
#[1] 2 3 4 5
#[[2]]
#[1] 2 5 48 4
#[[3]]
#[1] 2 3 4 35
#[[4]]
#[1] 23 3 4 85
#[[5]]
#[1] 2 3 48 5

Resources