R Vlookup Two Criteria and Fill in the Value - r

Actual dataframe consist of more than a million rows.
Say for example a dataframe is:
UniqueID Code Value OtherData
1 A 5 Z01
1 B 6 Z02
1 C 7 Z03
2 A 10 Z11
2 B 11 Z24
2 C 12 Z23
3 A 10 Z21
I want to obtain ratio of A/B. For example, for UniqueID 1, its ratio of A/B = 5/6.
Thus, I transform the original dataframe to:
UniqueID A_Value B_Value Ratio_A/B
1 5
2 10
3 10
Question is, how do I lookup the original dataframe by its UniqueID and then fill in its B value? If there is no B value, then just return 0.
Thank you.

You can first remove the columns which are not necessary, select only rows where Code has value "A" or "B", get the data in wide format and create a new column with the value of A/B
library(dplyr)
library(tidyr)
df %>%
select(-OtherData) %>%
filter(Code %in% c("A", "B")) %>%
pivot_wider(names_from = Code, values_from = Value, values_fill = list(Value = 0)) %>%
#OR if you want to have NA values instead of 0 use
#pivot_wider(names_from = Code, values_from = Value) %>%
mutate(Ratio_A_B = A/B)
# UniqueID A B Ratio_A_B
# <int> <int> <int> <dbl>
#1 1 5 6 0.833
#2 2 10 11 0.909
#3 3 10 0 Inf

Related

R - Summarize dataframe to avoid NAs

Having a dataframe like:
id = c(1,1,1)
A = c(3,NA,NA)
B = c(NA,5,NA)
C= c(NA,NA,2)
df = data.frame(id,A,B,C)
id A B C
1 1 3 NA NA
2 1 NA 5 NA
3 1 NA NA 2
I want to summarize the whole dataframe in one row that it contains no NAs. It should looke like:
id A B C
1 1 3 5 2
It should work also when the dataframe is bigger and contains more ids but in the same logic.
I didnt found the right function for that and tried some variations of summarise().
You can group_by id and use max with na.rm = TRUE:
library(dplyr)
df %>%
group_by(id) %>%
summarise(across(everything(), max, na.rm = TRUE))
id A B C
1 1 3 5 2
If multiple cases, max may not be what you want, you can use sum instead.
Using fmax from collapse
library(collapse)
fmax(df[-1], df$id)
A B C
1 3 5 2
Alternatively please check the below code
data.frame(id,A,B,C) %>% group_by(id) %>% fill(c(A,B,C), .direction = 'downup') %>%
slice_head(n=1)
Created on 2023-02-03 with reprex v2.0.2
# A tibble: 1 × 4
# Groups: id [1]
id A B C
<dbl> <dbl> <dbl> <dbl>
1 1 3 5 2

Use pivot_wider on table but keep count of rows

So my problem is as follows, I have a small data frame like this:
test_df <- data.frame(id=c(1,1,2,2,2), ttype=c("D", "C", "D", "D", "C"), val=c(1, 5, 10, 5, 100))
test_df
id ttype val
1 1 A 1
2 1 B 5
3 2 A 10
4 2 A 5
5 2 B 100
Now I want to make it wider to end up like this:
id A B n
1 1 5 1 2
2 2 100 15 3
So I want to replace the ttype with a column for each value, grouped by id with the summed values of val. But my problem is that I still want to keep track of how many either A or B occurred in total for each id, which is n in this case.
Now I found a way to do this, but it is very ugly. But this way works:
test_df %>%
group_by(id, ttype) %>%
summarise(val = sum(val), n=n()) %>%
pivot_wider(names_from = ttype, values_from=c(val, n), values_fill=0) %>%
mutate(n=n_A+n_B) %>%
select(-n_A, -n_B)
results in:
# A tibble: 2 x 4
# Groups: id [2]
id val_A val_B n
<dbl> <dbl> <dbl> <int>
1 1 5 1 2
2 2 100 15 3
So here the amount of A en B is included separately, after which I sum them and remove both other columns. But this means I have to hardcode column names and makes it not really doable when there are more than 2 values in ttype.
I feel like there must be a simple way to do this, but I can't figure it out.
You can add count of id rows as new column and get data in wide format using pivot_wider by taking sum of val values.
library(dplyr)
library(tidyr)
test_df %>%
add_count(id) %>%
pivot_wider(names_from = ttype, values_from = val, values_fn = sum)
# id n D C
# <dbl> <int> <dbl> <dbl>
#1 1 2 1 5
#2 2 3 15 100

Get rows with same values and creates different columns in R

I have a df with repeated sequence in first column and I want to get the values within the same number (in column 1) and create columns with them.
Obs: my df has 25502100 rows and the sequence is formed by 845 values.
See one simple example of my df below:
df <- data.frame(x = c(1,2,3,4,1,2,3,4), y = c(0.1,-2,-3,1,0,10,6,9))
I would like a function to transform this df in:
df_new
x y z
1 1 0.1 0
2 2 -2.0 10
3 3 -3.0 6
4 4 1.0 9
Does anyone has a solution?
An option with pivot_wider
library(tidyr)
library(data.table)
library(dplyr)
df %>%
mutate(rn = c('y', 'z')[rowid(x)]) %>%
pivot_wider(names_from = rn, values_from = y)
-output
# A tibble: 4 x 3
# x y z
# <dbl> <dbl> <dbl>
#1 1 0.1 0
#2 2 -2 10
#3 3 -3 6
#4 4 1 9

How to Pass column name in group by from a variable

Want to extract max values of a column of each group of data frame.
I have column name in a variable which i want to pass in group by condition but it is failing.
I have below data frame:
df <- read.table(header = TRUE, text = 'Gene Value
A 12
A 10
B 3
B 5
B 6
C 1
D 3
D 4')
Column values in Variables below:
columnselected <- c("Value")
groupbycol <- c("Gene")
My Code is :
df %>% group_by(groupbycol) %>% top_n(1, columnselected)
This code is giving error.
Gene Value
A 12
B 6
C 1
D 4
You need to convert column names to symbol using sym and then evaluate them using !!
library(dplyr)
df %>% group_by(!!sym(groupbycol)) %>% top_n(1, !!sym(columnselected))
# Gene Value
# <fct> <int>
#1 A 12
#2 B 6
#3 C 1
#4 D 4
We can use group_by_at and without using an additional package
library(dplyr)
df %>%
group_by_at(groupbycol) %>%
top_n(1, !! as.name(columnselected))
# A tibble: 4 x 2
# Groups: Gene [4]
# Gene Value
# <fct> <int>
#1 A 12
#2 B 6
#3 C 1
#4 D 4
NOTE: There would be many dupes for this post :=)

For each group find observations with max value of several columns

Assume I have a data frame like so:
set.seed(4)
df<-data.frame(
group = rep(1:10, each=3),
id = rep(sample(1:3), 10),
x = sample(c(rep(0, 15), runif(15))),
y = sample(c(rep(0, 15), runif(15))),
z = sample(c(rep(0, 15), runif(15)))
)
As seen above, some elements of x, y, z vectors take value of zero, the rest being drawn from the uniform distribution between 0 and 1.
For each group, determined by the first column, I want to find three IDs from the second column, pointing to the highest value of x, y, z variables in the group. Assume there are no draws except for the cases in which a variable takes a value of 0 in all observations of a given group - in that case I don't want to return any number as an id of a row with maximum value.
The output would look like so:
group x y z
1 2 2 1
2 2 3 1
... .........
My first thought is to select rows with maximum values separately for each variable and then use merge to put it in one table. However, I'm wondering if it can be done without merge, for example with standard dplyr functions.
Here is my proposed solution using plyr:
ddply(df,.variables = c("group"),
.fun = function(t){apply(X = t[,c(-1,-2)],MARGIN = 2,
function(z){ifelse(sum(abs(z))==0,yes = NA,no = t$id[which.max(z)])})})
# group x y z
#1 1 2 2 1
#2 2 2 3 1
#3 3 1 3 2
#4 4 3 3 1
#5 5 2 3 NA
#6 6 3 1 3
#7 7 1 1 2
#8 8 NA 2 3
#9 9 2 1 3
#10 10 2 NA 2
A solution uses dplyr and tidyr. Notice that if all numbers are the same, we cannot decide which id should be selected. So filter(n_distinct(Value) > 1) is added to remove those records. In the final output df2, NA indicates such condition where all numbers are the same. We can decide whether to impute those NA later if we want. This solution should work for any numbers of id or columns (x, y, z, ...).
library(dplyr)
library(tidyr)
df2 <- df %>%
gather(Column, Value, -group, -id) %>%
arrange(group, Column, desc(Value)) %>%
group_by(group, Column) %>%
# If all values from a group-Column are all the same, remove that group-Column
filter(n_distinct(Value) > 1) %>%
slice(1) %>%
select(-Value) %>%
spread(Column, id)
If you want to stick with just dplyr, you can use the multiple-column summarize/mutate functions. This should work regardless of the form of id; my initial attempt was slightly cleaner but assumed that an id of zero was invalid.
df %>%
group_by(group) %>%
mutate_at(vars(-id),
# If the row is the max within the group, set the value
# to the id and use NA otherwise
funs(ifelse(max(.) != 0 & . == max(.),
id,
NA))) %>%
select(-id) %>%
summarize_all(funs(
# There are zero or one non-NA values per group, so handle both cases
if(any(!is.na(.)))
na.omit(.) else NA))
## # A tibble: 10 x 4
## group x y z
## <int> <int> <int> <int>
## 1 1 2 2 1
## 2 2 2 3 1
## 3 3 1 3 2
## 4 4 3 3 1
## 5 5 2 3 NA
## 6 6 3 1 3
## 7 7 1 1 2
## 8 8 NA 2 3
## 9 9 2 1 3
## 10 10 2 NA 2

Resources