How to drop observations based on conditions - r

I have a subset data that has a total count for each observation from a bigger dataset. If I want to drop duplicates based on a higher count and drop codes that appear less if the name is the same, how would I go about that? So for instance:
name = c("a", "a", "b", "b", "b", "c", "d", "e", "e", "e")
code = c(1,1,2,3,4,1,1,2,2,3)
n = c(1,10,2,3,5,4,8,100,90,40)
data = data.frame(name,code,n)
The end product would be left with these:
name = c("a", "b", "c", "d", "e")
code = c(1,4,1,1,2)
n = c(10,5,4,8,100)
data2 = data.frame(name,code,n)

If you can use dplyr, this should do the trick:
library(dplyr)
data %>%
group_by(name) %>%
filter(n == max(n)) %>%
ungroup()

Related

How to duplicate rows with incontinuous dates in R

I need to duplicate rows with incontinuous dates to fill all the dates in a dataframe.
Suppose this df:
df <- data.frame(date = c("2022-07-05", "2022-07-07", "2022-07-11", "2022-07-15", "2022-07-18"), letter = c("a", "b", "a", "b", "c"))
The desired output is this df_new:
df_new <- data.frame(date = c("2022-07-05", "2022-07-06",
"2022-07-07", "2022-07-08", "2022-07-09", "2022-07-10",
"2022-07-11", "2022-07-12", "2022-07-13", "2022-07-14",
"2022-07-15"),
letter = c("a", "a",
"b", "b", "b", "b",
"a", "a", "a", "a",
"c"))
Could you please help ?
We could use complete from tidyr to expand the data based on the min/max date incremented by '1 day' and then fill the NA elements in 'letter' by the previous non-NA element
library(dplyr)
library(tidyr)
df %>%
mutate(date = as.Date(date)) %>%
complete(date = seq(min(date), max(date), by = '1 day')) %>%
fill(letter)

How summarise count equal rows between factor columns R?

I have followed data example
df <- tibble(var1 = factor(c("a", "a", "a", "b", "b", "c", "c", "c")),
var2 = factor(c("a", "b", "b", "c", "c", "c", "d", "d")))
I would like to summarise this data as one row and three columns (1X3) in tibble format. Where the first column show the counting of similar row values, the second column show the counting of different and the third column the total of values with final format as:
final <- tibble(equal = 2, different = 6, total = 8)
thank you all
You could use
library(dplyr)
df %>%
summarise(equal = sum(as.numeric(var1) == as.numeric(var2)),
different = sum(as.numeric(var1) != as.numeric(var2)),
total = n())
This returns
# A tibble: 1 x 3
equal different total
<int> <int> <int>
1 2 6 8

Replace values in vector where not %in% vector

Short question:
I can substitute certain variable values like this:
values <- c("a", "b", "a", "b", "c", "a", "b")
df <- data.frame(values)
What's the easiest way to replace all the values of df$values by "x" (where the value is neither "a" or "b")?
Output should be:
c("a", "b", "a", "b", "x", "a", "b")
Your example is a bit unclear and not reproducible.
However, based on guessing what you actually want, I could suggest trying this option using the data.table package:
df[values %in% c("a", "b"), values := "x"]
or the dplyr package:
df %>% mutate(values = ifelse(values %in% c("a","b"), x, values))
What about:
df[!df[, 1] %in% c("a", "b"), ] <- "x"
values
1 a
2 b
3 a
4 b
5 x
6 a
7 b

Conditional Calculation of Columns in R

I have a dataset like the one shown below
library(tidyverse)
dat <- data.frame(col.1 = 1:16,
col.2 = c("B", "B", "B", "B", "B", "B", "A", "B",
"A", "A", "B", "A", "A", "A", "A", "A"),
col.3 = c(30, 60, 75, 105, 40, 80, -20, 60, -20, -60, 40,
-40,-105,-20,-20,-45),
col.4 = c(39.34775, 31.66806, 28.57107, 28.43085, 29.30417, 36.21187,
40.29794, 40.70641, 65.85152, 66.85943, 69.26766, 67.24402,
74.85330, 79.17230, 78.75405, 64.47038))
dat
I'm trying to reach the final column which looks like this:
dat.2 <- dat %>%
mutate(col.Final = c(1180.43, 1900.08, 2142.83, 2985.24, 1172.17,
2896.95, -629.63, 2442.38, -655.37, -1966.11,
2770.71, -1460.48, -3833.76, -730.24, -730.24,
-1643.04))
So far, I have tried using mutate() function to reach this point.
dat.1 <- dat %>%
mutate(col.5 = col.3*col.4) %>%
mutate(col.6 = cumsum(col.3)) %>%
mutate(col.7 = if_else(col.2 == 'B', col.6, col.6 - col.3),
col.8 = col.3/col.7)
When I'm trying to reach the final column I'm not getting the same results.
dat.1 %>%
mutate(col.9 = if_else(col.2 == 'A', col.8*lag(cumsum(col.5)), col.5))
Note: This same calculation was done successfully using Excel's SUMIFS() function.
I'm Trying to get the same results with R instead.
I have seen some of the Q&A for similar posts but still stuck with the final calculation. In Excel, it felt as if iteration was performed for certain condition and then the next condition was executed. Though, not sure what was done using excel, I think, somehow this is possible using R as well. Just unable to figure out how to get that.
Any help would be appreciated at this point.
Update:
Values for col.5 and col.8 corresponding to col.2:
col.2 = c("B", "B", "B", "B", "B", "B", "A", "B",
"A", "A", "B", "A", "A", "A", "A", "A")
col.5 <- c(1180.4325, 1900.0836, 2142.8302, 2985.2393, 1172.1668,
2896.9496, -805.9588, 2442.3846, -1317.0304, -4011.5658,
2770.7064, -2689.7608, -7859.5965, -1583.4460, -1575.0810,
-2901.1671)
col.8 <-c(1.00000000, 0.66666667, 0.45454545, 0.38888889, 0.12903226,
0.20512821, -0.05128205, 0.13953488, -0.04651163, -0.14634146,
0.10256410,-0.10256410, -0.30000000, -0.08163265, -0.08888889,
-0.21951220)
Verifying values Using Hand Calculation!
Calculations using col.5 & col.8
for "B" from top :
1180.43 + 1900.08 + 2142.83 + 2985.24 + 1172.17 + 2896.95 = 12277.7020
for A after :
12277.7020 x -0.05128205 = -629.6266509 .... the 1st desired value for A
for "B" after:
12277.720 - 629.6266509 = 11648.07535
11648.07535 + 2442.3846 = 14090.45995
for "A" after:
14090.45995 x -0.04651163 = -655.37026 ... 2nd desired Value for A
for "A" after:
14090.45995 - 655.37026 = 13435.08969
13435.08969 x -0.14634146 = -1966.110641 ... 3rd desired value for A
and so on....
I hope this explains.

create long list of variables based on existing variables

I have a long list of variables and for each I want to create a dummy variable. I am using the below dplyr mutate code to do this, but know that something like an array in SAS could be used (so I don't have to copy this line out multiple times). I just haven't been able to find an answer on Stack or anywhere else that fits.
Grade_Dist2 <- Grade_Dist2 %>% mutate(
ACCT2301_FA15_z = ifelse(ACCT2301_FA15 %in% c("A", "B", "C"), 1,
ifelse(ACCT2301_FA15 %in% c("D", "F", "W", "Q"), 0, NA)))
The columns/vars are arranged together--all vars in the table are similar except an ID var.
In the tidyverse you should probably look at something like mutate_all(), but in the meantime I would think something like this base R solution would work:
all_names <- grep("FA[0-9]+",names(Grade2),value=TRUE)
for (id in all_names) {
cur_var <- Grade2[[id]]
Grade2[[paste0(id,"_z")]] <-
ifelse(cur_var %in% c("A", "B", "C"), 1,
ifelse(cur_var %in% c("D", "F", "W", "Q"), 0, NA)))
}
Here's a try at using a tidyverse approach with mutate_all as suggested by #BenBolker.
library(tidyverse)
Grade_Dist2 <- tibble(ACCT2301_FA15_z = c("A", "F", "C", "Z"))
Grade_Dist2 <- Grade_Dist2 %>%
mutate_all(., funs(if_else(. %in% c("A", "B", "C"), 1,
if_else(. %in% c("D", "F", "W", "Q"), 0, NA_real_))))
Grade_Dist2
#> # A tibble: 4 x 1
#> ACCT2301_FA15_z
#> <dbl>
#> 1 1
#> 2 0
#> 3 1
#> 4 NA
If you want to append the dummy variables to the existing data instead of overwriting then
mutate_all(., funs("dummy" = if_else(. %in% c("A", "B", "C"), 1,
if_else(. %in% c("D", "F", "W", "Q"), 0, NA_real_))))
will append variables with names like ACCT2301_FA15_z_dummy (or be called dummy if there is only one variable being mutated).

Resources