Get rows with same values and creates different columns in R - r

I have a df with repeated sequence in first column and I want to get the values within the same number (in column 1) and create columns with them.
Obs: my df has 25502100 rows and the sequence is formed by 845 values.
See one simple example of my df below:
df <- data.frame(x = c(1,2,3,4,1,2,3,4), y = c(0.1,-2,-3,1,0,10,6,9))
I would like a function to transform this df in:
df_new
x y z
1 1 0.1 0
2 2 -2.0 10
3 3 -3.0 6
4 4 1.0 9
Does anyone has a solution?

An option with pivot_wider
library(tidyr)
library(data.table)
library(dplyr)
df %>%
mutate(rn = c('y', 'z')[rowid(x)]) %>%
pivot_wider(names_from = rn, values_from = y)
-output
# A tibble: 4 x 3
# x y z
# <dbl> <dbl> <dbl>
#1 1 0.1 0
#2 2 -2 10
#3 3 -3 6
#4 4 1 9

Related

How to filter rows according to the bigger value in another column?

I have a data frame like below
d1<-c('a','b','c','d','e','f','g','h','i','j','k','l')
d2<-c(1,5,1,2,13,2,32,2,1,2,4,5)
df1<-data.frame(d1,d2)
Which looks like the data table in this picture
My goal is to filter the rows based on which value of d2 in every 3 rows is biggest. So it would look like this:
Thank you!
We may use rollmax from zoo to filter the rows
library(dplyr)
library(zoo)
df1 %>%
filter(d2 == na.locf0(rollmax(d2, k = 3, fill = NA)))
d1 d2
1 b 5
2 e 13
3 g 32
4 l 5
You can create a grouping variable that puts observations into groups of 3. I have first created a sequence from 1 to the total number of rows, incremented by 3. And then repeated each number of this sequence 3 times and subset the result to get a vector the same length of the data, incase the number of observations is not perfectly divisible by 3. Then simply filter rows based by the largest number of each group in d2 column.
library(dplyr)
df1 %>%
mutate(group = rep(seq(1, n(), by = 3), each = 3)[1:n()]) %>%
group_by(group) %>%
filter(d2 == max(d2))
# A tibble: 4 x 3
# Groups: group [4]
# d1 d2 group
# <chr> <dbl> <dbl>
# 1 b 5 1
# 2 e 13 4
# 3 g 32 7
# 4 l 5 10
Yet another solution:
library(tidyverse)
d1<-c('a','b','c','d','e','f','g','h','i','j','k','l')
d2<-c(1,5,1,2,13,2,32,2,1,2,4,5)
df1<-data.frame(d1,d2)
df1 %>%
mutate(id = rep(1:(n()/3), each=3)) %>%
group_by(id) %>%
slice_max(d2) %>%
ungroup %>% select(-id)
#> # A tibble: 4 × 2
#> d1 d2
#> <chr> <dbl>
#> 1 b 5
#> 2 e 13
#> 3 g 32
#> 4 l 5

how to subtract two columbs using index in tidyverse

i have a dataframe
df <- tibble(row1= c(1,2,3,4,5),
row2=c(2,3,4,5,6))
how do i subtract the two columbs using index (not rownames)? I would like this to work
df %>% mutate(diff= select(1)-select(2))
But the universe is not on my side....
The select needs a data parameter as the Usage is
select(.data, ...)
Also, as select returns a data.frame/tibble as output, we can get the vector with [[
library(dplyr)
df %>%
mutate(diff = select(., 1)[[1]] - select(., 2)[[1]])
-output
# A tibble: 5 x 3
# row1 row2 diff
# <dbl> <dbl> <dbl>
#1 1 2 -1
#2 2 3 -1
#3 3 4 -1
#4 4 5 -1
#5 5 6 -1
or instead use pull to return the vector
df %>%
mutate(diff = pull(., 1) - pull(., 2))
What about using select like below?
> df %>% mutate(diff = do.call(`-`,select(.,1:2)))
# A tibble: 5 x 3
row1 row2 diff
<dbl> <dbl> <dbl>
1 1 2 -1
2 2 3 -1
3 3 4 -1
4 4 5 -1
5 5 6 -1

R Vlookup Two Criteria and Fill in the Value

Actual dataframe consist of more than a million rows.
Say for example a dataframe is:
UniqueID Code Value OtherData
1 A 5 Z01
1 B 6 Z02
1 C 7 Z03
2 A 10 Z11
2 B 11 Z24
2 C 12 Z23
3 A 10 Z21
I want to obtain ratio of A/B. For example, for UniqueID 1, its ratio of A/B = 5/6.
Thus, I transform the original dataframe to:
UniqueID A_Value B_Value Ratio_A/B
1 5
2 10
3 10
Question is, how do I lookup the original dataframe by its UniqueID and then fill in its B value? If there is no B value, then just return 0.
Thank you.
You can first remove the columns which are not necessary, select only rows where Code has value "A" or "B", get the data in wide format and create a new column with the value of A/B
library(dplyr)
library(tidyr)
df %>%
select(-OtherData) %>%
filter(Code %in% c("A", "B")) %>%
pivot_wider(names_from = Code, values_from = Value, values_fill = list(Value = 0)) %>%
#OR if you want to have NA values instead of 0 use
#pivot_wider(names_from = Code, values_from = Value) %>%
mutate(Ratio_A_B = A/B)
# UniqueID A B Ratio_A_B
# <int> <int> <int> <dbl>
#1 1 5 6 0.833
#2 2 10 11 0.909
#3 3 10 0 Inf

Subtracting the last value in a group from previous values in dplyr

I have the following data
data = tribble(~t,~key,~value,
1,"a",10,
2,"a",20,
3,"a",30,
1,"b",100,
2,"b",200,
3,"b",300,
1,"c",1000,
2,"c",2000,
3,"c",3000)
and would like to get the following result
result = tribble(~t,~key,~value,
1,"a",-20,
2,"a",-10,
3,"a",0,
1,"b",-200,
2,"b",-100,
3,"b",0,
1,"c",-2000,
2,"c",-3000,
3,"c",0)
The idea is that I would like to subtract the 3rd value from all of the other values in that group. I tried to group_by the key, but struggled on the row wise subtraction within the group
We can use the last function from the dplyr. The arrange function is to make sure your dataset are in the right order.
library(dplyr)
data2 <- data %>%
arrange(key, t) %>%
group_by(key) %>%
mutate(value = value - last(value)) %>%
ungroup()
data2
# # A tibble: 9 x 3
# t key value
# <dbl> <chr> <dbl>
# 1 1 a -20
# 2 2 a -10
# 3 3 a 0
# 4 1 b -200
# 5 2 b -100
# 6 3 b 0
# 7 1 c -2000
# 8 2 c -1000
# 9 3 c 0

For each group find observations with max value of several columns

Assume I have a data frame like so:
set.seed(4)
df<-data.frame(
group = rep(1:10, each=3),
id = rep(sample(1:3), 10),
x = sample(c(rep(0, 15), runif(15))),
y = sample(c(rep(0, 15), runif(15))),
z = sample(c(rep(0, 15), runif(15)))
)
As seen above, some elements of x, y, z vectors take value of zero, the rest being drawn from the uniform distribution between 0 and 1.
For each group, determined by the first column, I want to find three IDs from the second column, pointing to the highest value of x, y, z variables in the group. Assume there are no draws except for the cases in which a variable takes a value of 0 in all observations of a given group - in that case I don't want to return any number as an id of a row with maximum value.
The output would look like so:
group x y z
1 2 2 1
2 2 3 1
... .........
My first thought is to select rows with maximum values separately for each variable and then use merge to put it in one table. However, I'm wondering if it can be done without merge, for example with standard dplyr functions.
Here is my proposed solution using plyr:
ddply(df,.variables = c("group"),
.fun = function(t){apply(X = t[,c(-1,-2)],MARGIN = 2,
function(z){ifelse(sum(abs(z))==0,yes = NA,no = t$id[which.max(z)])})})
# group x y z
#1 1 2 2 1
#2 2 2 3 1
#3 3 1 3 2
#4 4 3 3 1
#5 5 2 3 NA
#6 6 3 1 3
#7 7 1 1 2
#8 8 NA 2 3
#9 9 2 1 3
#10 10 2 NA 2
A solution uses dplyr and tidyr. Notice that if all numbers are the same, we cannot decide which id should be selected. So filter(n_distinct(Value) > 1) is added to remove those records. In the final output df2, NA indicates such condition where all numbers are the same. We can decide whether to impute those NA later if we want. This solution should work for any numbers of id or columns (x, y, z, ...).
library(dplyr)
library(tidyr)
df2 <- df %>%
gather(Column, Value, -group, -id) %>%
arrange(group, Column, desc(Value)) %>%
group_by(group, Column) %>%
# If all values from a group-Column are all the same, remove that group-Column
filter(n_distinct(Value) > 1) %>%
slice(1) %>%
select(-Value) %>%
spread(Column, id)
If you want to stick with just dplyr, you can use the multiple-column summarize/mutate functions. This should work regardless of the form of id; my initial attempt was slightly cleaner but assumed that an id of zero was invalid.
df %>%
group_by(group) %>%
mutate_at(vars(-id),
# If the row is the max within the group, set the value
# to the id and use NA otherwise
funs(ifelse(max(.) != 0 & . == max(.),
id,
NA))) %>%
select(-id) %>%
summarize_all(funs(
# There are zero or one non-NA values per group, so handle both cases
if(any(!is.na(.)))
na.omit(.) else NA))
## # A tibble: 10 x 4
## group x y z
## <int> <int> <int> <int>
## 1 1 2 2 1
## 2 2 2 3 1
## 3 3 1 3 2
## 4 4 3 3 1
## 5 5 2 3 NA
## 6 6 3 1 3
## 7 7 1 1 2
## 8 8 NA 2 3
## 9 9 2 1 3
## 10 10 2 NA 2

Resources