R function to find the sum of c given the column values - r

I want to create a function that I can simulate n number of times. My ultimate goal is to find if the sum of c for every n number of simulations. I am a beginner in r-coding so I am just starting to practice with for loops and if else statements.
This is what I hope to achieve as of now: If a> b, c would be "2" and if a < b, c would be "-2". If a=b, c would be determined by the a and b value of the NEXT row. This is what i have so far, but I am keep getting errors. I would like to know if what I have for a=b is how I should approach this. Any help is appreciated.
a<-c(5,6,7,8,9,10,1,4,6,7)
b<-c(4,6,8,5,3,4,5,2,1,3)
c<-c(0,0,0,0,0,0,0,0,0,0)
df<-data.frame(a,b,c)
if(df$a > df$b){
df$c<- c(2)}
else if(df$a < df$b){
df$c<- c(-2)}
else if(df$a == df$b){ # a=b
if(df$a[+1,] > df$b[+1,]) {
df$c<- c(2)}
else(df$a[+1,] < df$b[+1,]){
df$c<- c(-2) }
}
else
print("error")
}
sum(df$c)

The problem
if() and else() in R is meant for control flow, and is not vectorized. In plain English this means that if() is expecting a statement evaluating to one TRUE or FALSE. When you do df$a > df$b you get a boolean vector of the same length as rows in your dataframe. When this happens, if() will only use the first item, and give you a warning. This will give you the wrong answers.
A better solution
I think you are looking for ifelse() which is vectorized. And since you have nested if-else statements you are probably better off with dplyr::case_when().
Here is an example which also fixes cases where a == b for multiple rows:
# Note that I've added two consecutive rows where a == b
a <- c(5,6,6,7,8,9,10,1,4,6,7)
b <- c(4,6,6,8,5,3,4,5,2,1,3)
df <- data.frame(a, b)
library(dplyr)
df %>%
mutate(
c = case_when(
a > b ~ 2,
a < b ~ -2,
# If not a > b nor a < b is TRUE, they must be equal,
# so we set all other cases to NA...
TRUE ~ NA_real_
)
) %>%
# ... and then we use fill() to replace NAs with the first
# non NA valua after it
tidyr::fill(c, .direction = "up")
#> a b c
#> 1 5 4 2
#> 2 6 6 -2
#> 3 6 6 -2
#> 4 7 8 -2
#> 5 8 5 2
#> 6 9 3 2
#> 7 10 4 2
#> 8 1 5 -2
#> 9 4 2 2
#> 10 6 1 2
#> 11 7 3 2
Created on 2022-03-30 by the reprex package (v2.0.1)
How this works:
ifelse() works like if() and else() in your code, but it accepts multiple values
case_when() acts like nested ifelse() statements, so it will first check if a > b and set those values equal to 2, next it will check the remaining rows if a < b and set those to -2 and so on.
In cases where a is not less nor more than b, they must be equal. We set these cases to NA.
After we use tidyr::fill() to replace missing values with the first instance of a non-missing value after it. This handles cases where there are multiple consecutive rows of a == b.

Edit: two users already pointed out what to do if there's consecutive rows of a == b. Good opportunity to dive into the tidyverse (as already suggested by others):
library(dplyr)
library(tidyr)
df <- data.frame(
a = c(5,6,7,8,9,10,1,4,6,7),
b = c(4,6,8,5,3,4,5,2,1,3)
)
df %>%
mutate(c = ifelse(a == b, NA, 2 * sign(a-b))) %>% ## (1)
fill(c, .direction = 'up') ## (2)
(1) set c to NA when a == b
(2) 'fill' (replace) NAs with the next availabe value down the rows
Starting with R, it's helpful to know that vectorizing (the x[n] thing) usually makes your code conciser and—in certain situations— much faster than using loops. In your case:
df$c <- 2 * sign(df$a - df$b) ## see ?sign
z <- df$c == 0 ## see (1)
df$c[z] = lead(df$c,1)[z] ## see (2)
(1) equal numbers have sign zero, z is a boolean vector indicating the positions (rows) where a == b (thus: z is TRUE)
(2) change c only at the positions where z is TRUE. lead and lag are functions taking a vector and returning its shifted (by a given number of positions) vector.

Here is a tidyverse solution. This will also work with multiple equal a and b in series (I have added row 3 to the data to demonstrate).
It relies on cumsum() to group the data, such that rows with a == b are in the same group as the next row that is a != b. Then it sets c to the last value in the group.
library(tidyverse)
a<-c(5,6,5,7,8,9,10,1,4,6,7)
b<-c(4,6,5,8,5,3,4,5,2,1,3)
df <-data.frame(a,b)
df |>
mutate(c = ifelse(a>b, 2, -2), # Determines c for `a != b` cases
grp = rev(cumsum(rev(a != b)))) |> # create group variable, use rev() since we want backward cumsum
group_by(grp) |>
mutate(c = last(c)) |>
ungroup() |>
select(-grp)
#> # A tibble: 11 × 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 5 4 2
#> 2 6 6 -2
#> 3 5 5 -2
#> 4 7 8 -2
#> 5 8 5 2
#> 6 9 3 2
#> 7 10 4 2
#> 8 1 5 -2
#> 9 4 2 2
#> 10 6 1 2
#> 11 7 3 2
Created on 2022-03-30 by the reprex package (v2.0.1)

Related

Apply filter to the table function

I'm looking for a way to execute a simple task faster than I am currently able to.
I want to use the table function in R on part of a dataframe. Of course it would be possible to first use subset and then table, but this is a bit tedious. (In my case, during a first inspection of the data, I want to check the frequency of NAs on individual variables in a multi-national survey for each of the 25 participating countries. So I'd need to create 25 subsets, make the table, and then remove the subsets again because I don't need them anymore.)
Here is some example data:
a <- c(1,1,1,1,1,2,2,2,2,2)
b <- c(1,3,99,99,2,3,2,99,1,1)
df <- cbind.data.frame(a,b)
And this is the workaround solution.
df1 <- subset(df, a == 1)
table(df1$b)
df2 <- subset(df, a == 2)
table(df2$b)
rm(df1, df2)
Is there a simpler way?
Also, I feel like I am spamming with ultra-basic questions like these. If anyone has a suggestion on how I could have found the answer directly I'd be happy to hear it. Other than trying some code myself, I googled terms like 'r apply filter to table', 'r filter table function', 'r table subset dataframe', etc.
Assuming 99 are your NAs then there is a way using purrr package, which I find is excellent to see how many NAs there are in each column:
library(purrr)
df |>
map_df(~sum(. == 99))
a b
<int> <int>
1 0 3
Can you provide an example of the structure of the original data (multi-national survey)?
Probably you would be able to answer your question with a much tidier code using the package dplyr with functions such as
survey_data %>%
select(column1, column2, country, etc) %>% #choose your desired columns
group_by(country) %>%
summarise_all(funs(sum(is.na(.))))
You could split on your a variable and use lapply to use table on each list like this:
lapply(split(df, df$a), \(x) table(x))
#> $`1`
#> b
#> a 1 2 3 99
#> 1 1 1 1 2
#>
#> $`2`
#> b
#> a 1 2 3 99
#> 2 2 1 1 1
Created on 2023-02-18 with reprex v2.0.2
Just use it in an lapply.
alv <- unique(df$a)
lapply(alv, \(x) table(subset(df, a == x, b))) |> setNames(alv)
# $`1`
# b
# 1 2 3 99
# 1 1 1 2
#
# $`2`
# b
# 1 2 3 99
# 2 1 1 1
However, it might be better to code 99 (and probably others) as NA,
df[] <- lapply(df, \(x) replace(x, x %in% c(99), NA))
and count the NAs in b for each individual a.
with(df, tapply(b, a, \(x) sum(is.na(x))))
# 1 2
# 2 1
Just use table() on the whole dataframe, and pull out the parts you want afterwards. You convert the a and b values to character values when indexing into the two-way table. For example,
a <- c(1,1,1,1,1,2,2,2,2,2)
b <- c(1,3,99,99,2,3,2,99,1,1)
df <- cbind.data.frame(a,b)
full <- table(df$a, df$b)
full["1",] # corresponds to subset a == 1
#> 1 2 3 99
#> 1 1 1 2
full["2",] # corresponds to subset a == 2
#> 1 2 3 99
#> 2 1 1 1
full[, "99"] # corresponds to subset b == 99
#> 1 2
#> 2 1
Created on 2023-02-18 with reprex v2.0.2

Count repeated characters in a string

How do find the length of highest repeated character in a string
col1 repeated letter repeated number
apples333 2 3
summer13 2 0
talk77 0 2
Aa6668 2 3
I can use lengths(regmatches(str, gregexpr("a",str) or str_count(str,"a") but the idea is to automatically check which is the highest repeating char/number and return count.
Using rle and rawConversion functions:
d <- data.frame(col1 = c("apples333", "summer13", "talk77", "Aa6668"))
foo <- function(x, p){
r <- rle(charToRaw(tolower(x)))
res <- max(r$lengths[ grepl(p, rawToChar(r$values, multiple = TRUE)) ])
if(res == 1) res <- 0
res
}
d$repLetter <- sapply(d$col1, foo, p = "[a-z]")
d$repNumber <- sapply(d$col1, foo, p = "[0-9]")
d
# col1 repLetter repNumber
# 1 apples333 2 3
# 2 summer13 2 0
# 3 talk77 0 2
# 4 Aa6668 2 3
There is probably an elegant regex-based solution for this (obviously I am not a big regex-er). The following is based on determining the run length of a vector using the base-rle() function, i.e. counting the repetition of elements.
As a strategy, we develop a function to work on a single string input providing the different portions and associated occurrences/counts. Then, to operate over several input strings, we apply (loop) a function to each element of the input vector.
single loop
Let's see how rle() works:
x <- "abba" # a test string - who does not know ABBA
x_split <- strsplit(x, "") %>% unlist # split the string, unlist to coerce vector
x_rle <- rle(x_split) # apply rle()
# now let's check what we have
x_rle
Run Length Encoding
lengths: int [1:3] 1 2 1
values : chr [1:3] "a" "b" "a"
rle() returns a list. As you want to filter, etc. on your results, it might be easier to turn this into a data frame. We also store the actual input.
With a view to apply this to other strings (e.g. loop over input vector), we wrap this into a function call:
library(dplyr)
check_rle_char_num <- function(x){
# split the string and count occurrences
x_split <- strsplit(x, "") %>% unlist()
x_rle <- rle(x_split).
# turn it into a tibble
df <- with(x_rle, tibble(values, lengths)) %>%
# ----------- store the input string and check for chars/numerics
mutate( input = x
, is_num = grepl(pattern = "[0-9]", values) # logical check for numbers
) %>%
# ----------- order output tibble
select(input, everything())
}
check that it works:
> ( check_rle_char_num("Appllles44777") )
# A tibble: 7 x 4
input values lengths is_num
<chr> <chr> <int> <lgl>
1 Appllles44777 A 1 FALSE
2 Appllles44777 p 2 FALSE
3 Appllles44777 l 3 FALSE
4 Appllles44777 e 1 FALSE
5 Appllles44777 s 1 FALSE
6 Appllles44777 4 2 TRUE
7 Appllles44777 7 3 TRUE
We have all the pieces on which you can filter, select, etc. your desired output.
loop over multiple input strings
We use tidyverse's {purrr} package for this.
# multiple input strings
my_strings <- c("apples333", "summer13","talk77","Aa6668","Appllles44777")
# loop over my_strings
library(purrr)
test <- my_strings %>%
map_dfr(.f = ~ check_rle_char_num(.x)) # map_dfr returns a data frame
test
# A tibble: 29 x 4
input values lengths is_num
<chr> <chr> <int> <lgl>
1 apples333 a 1 FALSE
2 apples333 p 2 FALSE
3 apples333 l 1 FALSE
4 apples333 e 1 FALSE
5 apples333 s 1 FALSE
6 apples333 3 3 TRUE
7 summer13 s 1 FALSE
8 summer13 u 1 FALSE
9 summer13 m 2 FALSE
10 summer13 e 1 FALSE
final push, filter, and reshape a nice output tibble
# per problem statement - filter for maximum and min 2 counts (i.e. > 1)
result <- test %>%
group_by(input, is_num) %>%
filter(lengths == max(lengths), lengths > 1)
> result
# A tibble: 7 x 4
# Groups: input, is_num [7]
input values lengths is_num
<chr> <chr> <int> <lgl>
1 apples333 p 2 FALSE
2 apples333 3 3 TRUE
3 summer13 m 2 FALSE
4 talk77 7 2 TRUE
5 Aa6668 6 3 TRUE
6 Appllles44777 l 3 FALSE
7 Appllles44777 7 3 TRUE
Emulating a bit your results listed in the problem statement, one can reshuffle the columns and provide "nice" column names:
library(tidyr) # for reshuffling
result %>%
tidyr::pivot_wider( names_from = is_num
, values_from = c(values, lengths)
) %>%
#---------- we spread the tibble, "spread" column-names combine previous colnames and TRUE/FALSE - mind that TRUE were numbers
rename( char = values_FALSE
, char_count = lengths_FALSE
, nums = values_TRUE
, nums_count = lengths_TRUE) %>%
#---------- changing order of columns for nice output
select(input, starts_with("char"), starts_with("num"))
# A tibble: 5 x 5
# Groups: input [5]
input char char_count nums nums_count
<chr> <chr> <int> <chr> <int>
1 apples333 p 2 3 3
2 summer13 m 2 NA NA
3 talk77 NA NA 7 2
4 Aa6668 NA NA 6 3
5 Appllles44777 l 3 7 3
final notes
The solution presented
does the filtering on the result data frame (after the loop). If there are no other operations on your data, you can lift this into the function.
does not clean the NAs in the final output. If you need zeros for no letter or no number, you can replace the NAs.
keeps characters and numbers in a single data frame. Obviously, you can split them. One could combine both again based on a join() or bind_cols() on the input-variable. This saves the pivot-wider bit.
does not care for "ties", i.e. you have a sequence of multiple characters and/or numbers with the same count. You may have to handle this.
Last but not least: simplify the code, if none of the columns/variables kept in the tibble help for your problem.
Solution
You can go with this:
library(stringr)
max_freq <- Vectorize(function(x) max(tabulate(factor(x))))
df$repeated_letter <- max_freq(str_extract_all(str_to_lower(df$col1), "[:alpha:]"))
df$repeated_letter <- max_freq(str_extract_all(str_to_lower(df$col1), "[:digit:]"))
df
#> col1 repeated_letter repeated_number
#> 1 apples333 2 3
#> 2 summer13 2 1
#> 3 talk77 1 2
#> 4 Aa6668 2 3
#> 5 Appllles44777 3 3
Explanation
Following a breakdown of the solution step by step with some explanations:
# take your column
df$col1 |>
# set to lower so A and a is the same character
str_to_lower() |>
# extract only letters or digits as list of vectors
str_extract_all("[:alpha:]") |>
# get frequency table for each vector
lapply(factor) |> lapply(tabulate) |>
# extract the count of most repeated letter for each table and return a vector
sapply(max)
#> [1] 2 2 1 2 3
Data
Where df is:
df <- data.frame(col1 = c("apples333", "summer13", "talk77", "Aa6668", "Appllles44777"))
Warnings
When there are no repeated characters, 1 will be returned, which is actually a more consistent answer, since the most repeated character will be repeated once. If you prefer zero, you can replace all ones with zeros.
In case of no characters or no numbers, -Inf will be returned. If you want a different result (like zero) you can replace it. In your example, it was not specified an occurrence like that.
Though late to the party but this method might still be of interest:
library(tidyr)
library(stringr)
library(dplyr)
d %>%
# count the number of character repetitions:
mutate(
# for letters:
dup_w = lapply(str_extract_all(col1, "(?i)([a-z])\\1+"), nchar),
# for numbers:
dup_n = lapply(str_extract_all(col1, "([0-9])\\1+"), nchar)) %>%
# throw all repetition counts into a single column:
pivot_longer(c(dup_w, dup_n)) %>%
# show items in list:
unnest(cols = value) %>%
# group:
group_by(col1, name) %>%
# reduce dataframe to maximum values per group:
filter(value == max(value)) %>%
# widen the dataframe back to original format:
pivot_wider(names_from = name, values_from = value)
# A tibble: 5 x 3
# Groups: col1 [5]
col1 dup_w dup_n
<chr> <int> <int>
1 11applesssss333 5 3
2 summer13 2 NA
3 talk77 NA 2
4 Aa6668 2 3
5 Appllles44777 3 3
Data (with lots ore repetitions to make things clearer):
d <- data.frame(col1 = c("11applesssss333", "summer13", "talk77",
"Aa6668", "Appllles44777"))

R dplyr::mutate with ifelse conditioned on a global variable recycles result from first row

I am curious why an ifelse() statement within a call to dplyr::mutate() only seems to apply to the first row of my data frame. This returns a single value, which is recycled down the entire column. Since the expressions evaluated in either case of the ifelse() are only valid in the context of my data frame, I would expect the condition check and resulting expression evaluations to be performed on the columns as a whole, not just their first elements.
Here's an example: I have a variable defined outside the data frame called checkVar. Depending on the value of checkVar, I want to add differnt values to my data frame in a new column, z, that are computed as a function of existing columns.
If I do
checkVar <- 1
df <- data.frame( x=11:15, y=1:5 ) %>%
dplyr::mutate( z=ifelse(checkVar == 1, x/y, x-y) )
df
it returns
x y z
1 11 1 11
2 12 2 11
3 13 3 11
4 14 4 11
5 15 5 11
Instead of z being the quotient of x and y for each row, all rows are populated with the quotient of x and y from the first row of the data frame.
However, if I specify rowwise(), I get the result I want:
df <- df %>%
dplyr::rowwise() %>%
dplyr::mutate( z=ifelse(checkVar == 1, x/y, x-y) ) %>%
dplyr::ungroup()
df
returns
# A tibble: 5 x 3
x y z
<int> <int> <dbl>
1 11 1 11.000000
2 12 2 6.000000
3 13 3 4.333333
4 14 4 3.500000
5 15 5 3.000000
Why do I have to explicitly specify rowwise() when x and y are only defined as columns of my data frame?
This is not really related to dplyr::mutate but to how ifelse works, here is the docs ?ifelse:
ifelse returns a value with the same shape as test which is filled
with elements selected from either yes or no depending on whether the
element of test is TRUE or FALSE.
Usage
ifelse(test, yes, no)
And example:
ifelse(T, c(1,2,3), c(2,3,4))
# [1] 1
Your first case is vectorized, ifelse takes vectors x/y and x-y as yes and no parameters, since checkVar == 1 returns TRUE (scalar), ifelse returns (x/y)[1], i.e. the first element of vector x/y, which is 11 and get recycled to fill the new column z;
In your second case, mutate and ifelse is executed per row, so it's evaluated five times, and each time returns the value of x/y for that row.
If your condition is scalar, then you don't need vectorized ifelse, if/else is more suitable to use:
checkVar <- 1
mutate(df, z = if(checkVar == 1) x/y else x-y)
# x y z
#1 11 1 11.000000
#2 12 2 6.000000
#3 13 3 4.333333
#4 14 4 3.500000
#5 15 5 3.000000

find number sequence falls within ONE adjacent number (previous and next) by group

Let T={t|t=1,2,3..T} be the time (sequence order number) For each group, at each t when/if a sequence occurs, we need to make sure the sequence (it is a number,let's assume it is X) is within the set of {K-1,K,K+1}, where K is the previous sequence number at t-1. For example, if the previous sequence number K=4, for the next sequence X, if X fall within [3,4,5]. Then this X meet the requirement. If every X in the sequence meets the requirement, this group meets the require and labeled it TRUE.
I know the for loop can do the trick but I have large observations, it is very slow to do it in a loop. I known the cummax can help find the non-deceasing sequence quickly. I was wondering is there any quick solution like cummax.
seq <- c(1,2,1,2,3,1,2,3,1,2,1,2,2,3,4)
group <- rep(letters[1:3],each=5)
dt <- data.frame(group,seq)
> dt
group seq
1 a 1
2 a 2
3 a 1
4 a 2
5 a 3
6 b 1
7 b 2
8 b 3
9 b 1
10 b 2
11 c 1
12 c 2
13 c 2
14 c 3
15 c 4
The desired output:
y label
a:true
b:false
c:true
You can use the diff function to check if the adjacent sequence satisfies the condition:
library(dplyr)
dt %>% group_by(group) %>% summarize(label = all(abs(diff(seq)) <= 1))
# A tibble: 3 x 2
# group label
# <fctr> <lgl>
#1 a TRUE
#2 b FALSE
#3 c TRUE
Here is the corresponding data.table version:
library(data.table)
setDT(dt)[, .(label = all(abs(diff(seq)) <= 1)), .(group)]
You can do:
is.sequence <- function(x)
all(apply(head(cbind(x-1, x, x+1), -1) - x[-1] == 0, 1, any))
tapply(dt$seq, dt$group, is.sequence)
# a b c
# TRUE FALSE TRUE
Here is a base R example with aggregate and diff
aggregate(c(1, abs(diff(dt$seq)) * (tail(dt$group, -1) ==
head(dt$group, -1))),
dt["group"], function(i) max(i) < 2)
group x
1 a TRUE
2 b FALSE
3 c TRUE
The first argument to aggregate is a vector that uses diff and turns the result on and off (to zero) based on whether the current adjacent vector elements are in the same group.
We can also use aggregate from base R
aggregate(seq~group,dt, FUN = function(x) all(c(TRUE,
abs((x[-1] - x[-length(x)])) <=1)))
# group seq
#1 a TRUE
#2 b FALSE
#3 c TRUE

Remove columns from dataframe where some of values are NA

I have a dataframe where some of the values are NA. I would like to remove these columns.
My data.frame looks like this
v1 v2
1 1 NA
2 1 1
3 2 2
4 1 1
5 2 2
6 1 NA
I tried to estimate the col mean and select the column means !=NA. I tried this statement, it does not work.
data=subset(Itun, select=c(is.na(colMeans(Itun))))
I got an error,
error : 'x' must be an array of at least two dimensions
Can anyone give me some help?
The data:
Itun <- data.frame(v1 = c(1,1,2,1,2,1), v2 = c(NA, 1, 2, 1, 2, NA))
This will remove all columns containing at least one NA:
Itun[ , colSums(is.na(Itun)) == 0]
An alternative way is to use apply:
Itun[ , apply(Itun, 2, function(x) !any(is.na(x)))]
Here's a convenient way to do it using the dplyr function select_if(). Combine not (!), any() and is.na(), which is equivalent to selecting all columns that don't contain any NA values.
library(dplyr)
Itun %>%
select_if(~ !any(is.na(.)))
Alternatively, select(where(~FUNCTION)) can be used:
library(dplyr)
(df <- data.frame(x = letters[1:5], y = NA, z = c(1:4, NA)))
#> x y z
#> 1 a NA 1
#> 2 b NA 2
#> 3 c NA 3
#> 4 d NA 4
#> 5 e NA NA
# Remove columns where all values are NA
df %>%
select(where(~!all(is.na(.))))
#> x z
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 4
#> 5 e NA
# Remove columns with at least one NA
df %>%
select(where(~!any(is.na(.))))
#> x
#> 1 a
#> 2 b
#> 3 c
#> 4 d
#> 5 e
You can use transpose twice:
newdf <- t(na.omit(t(df)))
data[,!apply(is.na(data), 2, any)]
A base R method related to the apply answers is
Itun[!unlist(vapply(Itun, anyNA, logical(1)))]
v1
1 1
2 1
3 2
4 1
5 2
6 1
Here, vapply is used as we are operating on a list, and, apply, it does not coerce the object into a matrix. Also, since we know that the output will be logical vector of length 1, we can feed this to vapply and potentially get a little speed boost. For the same reason, I used anyNA instead of any(is.na()).
Another alternative with the dplyr package would be to make use of the Filter function
Filter(function(x) !any(is.na(x)), Itun)
with data.table would be a little more cumbersome
setDT(Itun)[,.SD,.SDcols=setdiff((1:ncol(Itun)),
which(colSums(is.na(Itun))>0))]
You can also try:
df <- df[,colSums(is.na(df))<nrow(df)]

Resources