R: combine rows with measurement below certain threshold - r

I have a "toy" data frame with 2 columns (x and y) and 8 rows. I would like to merge and sum all rows where y < 10. Value of merged x is not very important.
x = c("A","B","C","D","E","F","G","H")
y = c(20,17,16,14,12,9,6,5)
df = data.frame(x,y)
df
x y
1 A 20
2 B 17
3 C 16
4 D 14
5 E 12
6 F 9
7 G 6
8 H 5
Desired output:
x y
1 A 20
2 B 17
3 C 16
4 D 14
5 E 12
6 F 20
F is not necessary and can be set to Other. Thanks in advance!

I think this is what you are looking for.
x = c("A","B","C","D","E","F","G","H")
y = c(20,17,16,14,12,9,6,5)
df = data.frame(x = x[which(y > 10)],y = y[which(y > 10)])
df = rbind(df,data.frame(x = 'f',y = sum(y[which(y < 10)])))

We can also try with subset/transform/rbind
rbind(subset( df, y>=10),
transform(subset(df, y<10), x= x[1], y= sum(y))[1,])

Related

Concatenate row values given varying conditions in R

I am trying to concatenate certain row values (Strings) given varying conditions in R. I have flagged the row values in Flag (the flagging criteria are irrelevant in this example).
Notations: B is the beginning of a run and E the end. 0 is outside the run. 1 denotes any strings excluding B and E in the run. Your solution does not need to follow my convention.
Rules: Every run must begin with B and ends with E. There can be any number of 1 in the run. Any Strings positioned between B and E (both inclusive) are to be concatenated in the order as they are positioned in the run, and replace the B-string. . 0-string will remain in the dataframe. 1- and E-strings will be removed after concatenation.
I haven't come up with anything close to the desired output.
set.seed(128)
df2 <- data.frame(Strings = sample(letters, 17, replace = T),
Flag = c(0,"B",1,1,"E","B","E","B","E",0,"B",1,1,1,"E",0,0))
Strings Flag
1 d 0
2 r B
3 q 1
4 r 1
5 v E
6 f B
7 y E
8 u B
9 c E
10 x 0
11 h B
12 w 1
13 x 1
14 t 1
15 j E
16 d 0
17 j 0
Intermediate output.
Strings Flag Result
1 d 0 d
2 r B r q r v
3 q 1 q
4 r 1 r
5 v E v
6 f B f y
7 y E y
8 u B u c
9 c E c
10 x 0 x
11 h B h w x t j
12 w 1 w
13 x 1 x
14 t 1 t
15 j E j
16 d 0 d
17 j 0 j
Desired output.
Result
1 d
2 r q r v
3 f y
4 u c
5 x
6 h w x t j
7 d
8 j
Here is a solution that might help you. However, I am still not sure if I got your point correctly:
library(dplyr)
df2 %>%
mutate(Flag2 = cumsum(Flag == 'B' | Flag == '0')) %>%
group_by(Flag2) %>%
summarise(Result = paste0(Strings, collapse = ' '))
# A tibble: 8 × 2
Flag2 Result
<int> <chr>
1 1 d
2 2 r q r v
3 3 f y
4 4 u c
5 5 x
6 6 h w x t j
7 7 d
8 8 j
Using dplyr:
library(dplyr)
set.seed(128)
df2 <- data.frame(Strings = sample(letters, 17, replace = T),
Flag = c(0,"B",1,1,"E","B","E","B","E",0,"B",1,1,1,"E",0,0))
df2 %>%
group_by(group = cumsum( (Flag=="B") + (lag(Flag,1,"0")=="E"))) %>%
mutate(Result=if_else(Flag=="B", paste0(Strings,collapse = " "),Strings)) %>%
filter(!(Flag %in% c("1", "E"))) %>% ungroup() %>%
select(-group, -Strings, -Flag)
#> # A tibble: 8 × 1
#> Result
#> <chr>
#> 1 d
#> 2 r q r v
#> 3 f y
#> 4 u c
#> 5 x
#> 6 h w x t j
#> 7 d
#> 8 j

Stacking datasets

I have two datasets that need to be stacked on top of each-other. Think of them as two subsets of one data-set. The issue is that they have completely different variables other than the "record_id" and one more variable
ds_app <- data.frame(record_id, a, b, c, d, e, f)
ds_vo <- data.frame(record_id, g, h, i, j, k, l , m, n, o, p, q)
Is there an easy way to stack these other than having to create dummy variables; variables with assigned NA values.
Thanks so much!
I guess you may need merge
merge(df1,df2, all = TRUE)
Example
> df1 <- data.frame(id = 1:3, a = 1:3, b = 4:6)
> df2 <- data.frame(id = 1:5, g = 1:5, h = 6:10, i = 11:15)
> df1
id a b
1 1 1 4
2 2 2 5
3 3 3 6
> df2
id g h i
1 1 1 6 11
2 2 2 7 12
3 3 3 8 13
4 4 4 9 14
5 5 5 10 15
> merge(df1,df2, all = TRUE)
id a b g h i
1 1 1 4 1 6 11
2 2 2 5 2 7 12
3 3 3 6 3 8 13
4 4 NA NA 4 9 14
5 5 NA NA 5 10 15
Another option with full_join
library(dplyr)
full_join(df1, df2)
data
df1 <- data.frame(id = 1:3, a = 1:3, b = 4:6)
df2 <- data.frame(id = 1:5, g = 1:5, h = 6:10, i = 11:15)

filter one dataframe via conditions in another

I want to recursively filter a dataframe, d by an arbitrary number of conditions (represented as rows in another dataframe z).
I begin with a dataframe d:
d <- data.frame(x = 1:10, y = letters[1:10])
The second dataframe z, has columns x1 and x2, which are lower and upper limits to filter d$x. This dataframe z may grow to be an arbitrary number of rows long.
z <- data.frame(x1 = c(1,3,8), x2 = c(1,4,10))
I want to return all rows of d for which d$x <= z$x1[i] and d$x >= z$x2[i] for all i, where i = nrow(z).
So for this toy example, exclude everything from 1:1, 3:4, 8:10, inclusive.
x y
2 2 b
5 5 e
6 6 f
7 7 g
We can create a sequence between x1 and x2 values and use anti_join to select rows from d that are not present in z.
library(tidyverse)
remove <- z %>%
mutate(x = map2(x1, x2, seq)) %>%
unnest(x) %>%
select(x)
anti_join(d, remove)
# x y
#1 2 b
#2 5 e
#3 6 f
#4 7 g
We can use a non-equi join
library(data.table)
i1 <- setDT(d)[z, .I, on = .(x >=x1, x <= x2), by = .EACHI]$I
i1
#[1] 1 3 4 8 9 10
d[i1]
# x y
#1: 1 a
#2: 3 c
#3: 4 d
#4: 8 h
#5: 9 i
#6: 10 j
d[!i1]
# x y
#1: 2 b
#2: 5 e
#3: 6 f
#4: 7 g
Or using fuzzyjoin
library(fuzzyjoin)
library(dplyr)
fuzzy_inner_join(d, z, by = c('x' = 'x1', 'x' = 'x2'),
match_fun = list(`>=`, `<=`)) %>%
select(names(d))
# A tibble: 6 x 2
# x y
# <int> <fct>
#1 1 a
#2 3 c
#3 4 d
#4 8 h
#5 9 i
#6 10 j
Or to get the rows not in 'x' from 'd'
fuzzy_anti_join(d, z, by = c('x' = 'x1', 'x' = 'x2'),
match_fun = list(`>=`, `<=`)) %>%
select(names(d))
# A tibble: 4 x 2
# x y
# <int> <fct>
#1 2 b
#2 5 e
#3 6 f
#4 7 g

convert matrix into dataframe in r

I am trying to convert a matrix to a dataframe and use a column name and row name in the matrix with variables in the dataframe.
here is the sample
sample = matrix(c(1,NA,NA,2,NA,3,NA,NA,5,NA,NA,6,NA,NA,NA,NA,8,NA,3,1),ncol = 4)
colnames(sample) = letters[1:4]
row.names(sample) = letters[22:26]
My dataset has a lot of NA so I am trying to remove all the NA in the dataframe.
so here is my desiring output,
data.frame(col = c("v","v","w","w","y","y","y","z"),
row = c("a","b","c","c","a","b","d","d"),
value = c(1,3,6,8,2,5,3,1))
Use melt from reshape2 package for reshaping, then clear NA. Finally, do some formating stuff to get your desired output (ordering, setting colnames...).
> library(reshape2)
> df <- na.omit(melt(sample)) # reshaping
> df <- df[order(df$Var1), ] # ordering
> colnames(df) <- c("col", "row", "value") # setting colnames
> df # getting desired output
col row value
1 v a 1
6 v b 3
12 w c 6
17 w d 8
4 y a 2
9 y b 5
19 y d 3
20 z d 1
With dplyr and magrittr
> library(magrittr)
> library(dplyr)
> sample %>% melt %>%
na.omit %>%
arrange(., Var1) %>%
setNames(c('col', 'row', 'value'))
col row value
1 v a 1
2 v b 3
3 w c 6
4 w d 8
5 y a 2
6 y b 5
7 y d 3
8 z d 1
Here is a base R method by replicating the row names and column names
out <- na.omit(data.frame(col = rownames(sample)[row(sample)],
row = colnames(sample)[col(sample)], value = c(sample)))
out <- out[order(out$col),]
row.names(out) <- NULL
out
# col row value
#1 v a 1
#2 v b 3
#3 w c 6
#4 w d 8
#5 y a 2
#6 y b 5
#7 y d 3
#8 z d 1

extracting variables in R using frequencies

Suppose I have a dataframe:
x y
a 1
b 2
a 3
a 4
b 5
c 6
a 7
d 8
a 9
b 10
e 12
b 13
c 15
I want to create another dataframe that includes only the x values that occur at least 3 times (a and b, in this case), and their highest corresponding y values.
So I want the output as:
x y
a 9
b 13
Here 9 and 13 are the highest values of a and b respectively
I tried using:
sort-(table(x,y))
but it did not work.
The data.table package is great for this. If df is the original data, you can do
library(data.table)
setDT(df)[, .(y = max(y)[.N >= 3]), by=x]
# x y
# 1: a 9
# 2: b 13
.N is an integer that tells us how many rows are in each group (which we've set to x here). So we just subset max(y) such that .N is at least three.
Here's one way, using subset to omit any x that occur less than 3 times, and then aggregate to find the maximum value by group:
d <- read.table(text='x y
a 1
b 2
a 3
a 4
b 5
c 6
a 7
d 8
a 9
b 10
e 12
b 13
c 15', header=TRUE)
with(subset(d, x %in% names(which(table(d$x) >= 3))),
aggregate(list(y=y), list(x=x), max))
# x y
# 1 a 9
# 2 b 13
And for good measure, a dplyr approach:
library(dplyr)
d %>%
group_by(x) %>%
filter(n() >= 3) %>%
summarise(max(y))
# Source: local data frame [2 x 2]
#
# x max(y)
# 1 a 9
# 2 b 13

Resources