Grouping into desired number of groups - r

I have a data frame like this:
ID is the primary key and Apples is the number of apples that person has.
ID
Apples
E1
10
E2
5
E3
NA
E4
5
E5
8
E6
12
E7
NA
E8
4
E9
NA
E10
8
I want to group NA and non-NA values into only 2 separate groups and get the count of each. I tried the normal group_by(), but it does not give me desired output.
Fruits %>% group_by(Apples) %>% summarize(n())
Apples n()
<dbl> <int>
4 1
5 2
8 2
10 1
12 1
NA 3
My desired output:
Apples n()
<dbl> <int>
non-NA 7
NA 3

We can create a group for NA and non-NA using group_by, and we can also make it a factor so that we can change the labels in the same step. Then, get the number of observations for each group.
library(dplyr)
df %>%
group_by(grp = factor(is.na(Apples), labels=c("non-NA", "NA"))) %>%
summarise(`n()`= n())
# grp `n()`
# <fct> <int>
#1 non-NA 7
#2 NA 3
Or in base R, we could use colSums:
data.frame(Apples = c("non-NA", "NA"), n = c(colSums(!is.na(df))[2], colSums(is.na(df))[2]), row.names = NULL)
Data
df <- structure(list(ID = c("E1", "E2", "E3", "E4", "E5", "E6", "E7",
"E8", "E9", "E10"), Apples = c(10L, 5L, NA, 5L, 8L, 12L, NA,
4L, NA, 8L)), class = "data.frame", row.names = c(NA, -10L))

In base R, this can be done with table on a logical vector
table(!is.na(df1$Apples))

Related

Show duplicate value on a separate row in pivot wider

I have seen tons of answers but could not get it right. Basically I want to show duplciate on separate row while performing pivot wider.I created a unique variable as well but the result was either nested row or separate row for each column.
df <- structure(list(identifier = c("e1", "e1", "e2", "e2", "e1", "e1",
"e1", "e1", "e2", "e2"), label = c("Monaco", "became", "the",
"first", "the", "the", "Monaco", "became", "the", "first"), id = c("CP1",
"CP1", "CP1", "CP1", "CP1", "CP1", "CP2", "CP2", "CP2", "CP2"
), value = c(0L, 0L, 1L, 0L, 10L, 1L, 1L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-10L))
library(tidyverse)
#My try
df %>%
group_by(identifier,label) %>%
mutate(rn=row_number()) %>%
pivot_wider( names_from="id",
values_from="value")
library(data.table)
library(tidyr)
unnest( dcast(setDT(df), identifier + label ~ id, value.var = "value",
fill = NA, fun.aggregate = list), cols = c("CP1", "CP2"))
# # A tibble: 6 x 4
# identifier label CP1 CP2
# <chr> <chr> <int> <int>
# 1 e1 Monaco 0 1
# 2 e1 became 0 0
# 3 e1 the 10 NA
# 4 e1 the 1 NA
# 5 e2 first 0 0
# 6 e2 the 1 1
You can use -
library(dplyr)
library(tidyr)
df %>%
pivot_wider(names_from=id,values_from=value, values_fn = list) %>%
unnest(cols = c(CP1, CP2))
# identifier label CP1 CP2
# <chr> <chr> <int> <int>
#1 e1 Monaco 0 1
#2 e1 became 0 0
#3 e2 the 1 1
#4 e2 first 0 0
#5 e1 the 10 NA
#6 e1 the 1 NA
You were close with your attempt as well, you had to include id in group_by -
df %>%
group_by(identifier,label, id) %>%
mutate(rn=row_number()) %>%
pivot_wider(names_from=id,values_from=value)

Aggregating rows across multiple values

I have a large dataframe with approximately this pattern:
Person
Rate
Street
a
b
c
d
e
f
A
2
XYZ
1
NULL
3
4
5
NULL
A
2
XYZ
NULL
2
NULL
NULL
NULL
NULL
A
3
XYZ
NULL
NULL
NULL
NULL
NULL
6
B
2
DEF
NULL
NULL
NULL
NULL
5
NULL
B
2
DEF
NULL
2
3
NULL
NULL
6
C
1
DEF
1
2
3
4
5
6
A, b, c, d, e, f represents about 600 columns.
I am trying to combine the columns so that each person becomes one line, rows a-f combine into a single line using sum, and any conflicting rate or street information becomes a new row. So the data should look something like this:
Person
Rate
Rate 2
Street
a
b
c
d
e
f
A
2
3
XYZ
1
2
3
4
5
6
B
2
DEF
NULL
2
3
NULL
5
6
C
1
DEF
1
2
3
4
5
6
I keep trying to make this work with aggregate and summarize but I'm not sure that's the right approach.
Thank you very much for your help!
First we pivot all the unique rates per person and street.
library(reshape2)
tmp1=dcast(unique(df[,c("Person","Rate","Street")]),Person+Street~Rate,value.var="Rate")
colnames(tmp1)[-c(1:2)]=paste("Rate",colnames(tmp1)[-c(1:2)])
Then we aggregate and sum by person and rate, columns 4 to 9, from "a" to "f", change accordingly.
tmp2=aggregate(df[,4:9],list(Person=df$Person,Street=df$Street),function(x){
ifelse(all(is.na(x)),NA,sum(x,na.rm=T))
})
And finally merge the two.
merge(tmp1,tmp2,by=c("Person","Street"))
Person Street Rate 1 Rate 2 Rate 3 a b c d e f
1 A XYZ NA 2 3 1 2 3 4 5 6
2 B DEF NA 2 NA NA 2 3 NA 5 6
3 C DEF 1 NA NA 1 2 3 4 5 6
Perhaps, you can do this in two-step process -
library(dplyr)
library(tidyr)
#sum columns a-f
table1 <- df %>%
group_by(Person) %>%
summarise(across(a:f, sum, na.rm = TRUE))
#Remove duplicated values and get the data in separate columns
#for Rate and Street columns.
table2 <- df %>%
group_by(Person) %>%
mutate(across(c(Rate, Street), ~replace(., duplicated(.), NA))) %>%
select(Person, Rate, Street) %>%
filter(if_any(c(Rate, Street), ~!is.na(.))) %>%
mutate(col = row_number()) %>%
ungroup %>%
pivot_wider(names_from = col, values_from = c(Rate, Street)) %>%
select(where(~any(!is.na(.))))
#Join the two data to get final result
inner_join(table1, table2, by = 'Person')
# Person a b c d e f Rate_1 Rate_2 Street_1
# <chr> <int> <int> <int> <int> <int> <int> <int> <int> <chr>
#1 A 1 2 3 4 5 6 2 3 XYZ
#2 B 0 2 3 0 5 6 2 NA DEF
#3 C 1 2 3 4 5 6 1 NA DEF
data
It is helpful and easier to help when you share data in a reproducible format which can be copied directly. I have used the below data for the answer.
df <- structure(list(Person = c("A", "A", "A", "B", "B", "C"), Rate = c(2L,
2L, 3L, 2L, 2L, 1L), Street = c("XYZ", "XYZ", "XYZ", "DEF", "DEF",
"DEF"), a = c(1L, NA, NA, NA, NA, 1L), b = c(NA, 2L, NA, NA,
2L, 2L), c = c(3L, NA, NA, NA, 3L, 3L), d = c(4L, NA, NA, NA,
NA, 4L), e = c(5L, NA, NA, 5L, NA, 5L), f = c(NA, NA, 6L, NA,
6L, 6L)), row.names = c(NA, -6L), class = "data.frame")

How can I use data in df.x and use functions to select and add to df.y

I am new to R and I am trying to write a piece of code that will enable me to pick some data in df.x and put it in df.y:
Category 2019 2020 2021 2022 2023
Apple 3 4 5 6 7
Pear 3 4 5 6 7
Banana 3 4 5 6 7
Oranges 3 4 5 6 7
I want to select the value for oranges in 2019 and put in df.y and differences in years for Apple into a new df.y, like this:
Resource 2019 2020 2021 2022 2023
Orange 3 4 5 6 7
Apple 1 1 1 1
Any helps are appreciated!!!
Thanks
This is a tidyverse approach involving wide to long transform since it's easier to calculate the year differences.
library(tidyverse)
df.x <- tibble(
Category = c("Apple", "Pear", "Banana", "Orange"),
"2019" = 3,
"2020" = 4,
"2021" = 5,
"2022" = 6,
"2023" = 7
)
df.y <- df %>%
filter(Category %in% c("Apple", "Orange")) %>%
pivot_longer(-Category) %>%
mutate(value = ifelse(Category == "Apple", value - lag(value, 1), value)) %>%
pivot_wider()
# A tibble: 2 x 6
# Category `2019` `2020` `2021` `2022` `2023`
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Apple NA 1 1 1 1
#2 Orange 3 4 5 6 7
First you need to provide your data using dput(df.x):
df.x <- structure(list(Category = c("Apple", "Pear", "Banana", "Oranges"
), X2019 = c(3L, 3L, 3L, 3L), X2020 = c(4L, 4L, 4L, 4L), X2021 = c(5L,
5L, 5L, 5L), X2022 = c(6L, 6L, 6L, 6L), X2023 = c(7L, 7L, 7L,
7L)), class = "data.frame", row.names = c(NA, -4L))
Note that your column names have changed because R does not allow column/variable names to begin with a number. The process for extracting information from a data frame is covered in detail on the manual page: ?Extract. It is a bit dense so it may be easier to begin with some introductory tutorials on R.
To extract the row for Oranges:
row1 <- df.x[df.x$Category=="Oranges", ]
row1
# Category X2019 X2020 X2021 X2022 X2023
# 4 Oranges 3 4 5 6 7
The row number indicates that this is the 4th row in df.x. Now the second row is slightly more involved. First extract the row:
row2 <- df.x[df.x$Category=="Apple", ]
row2
# Category X2019 X2020 X2021 X2022 X2023
# 1 Apple NA 1 1 1 1
Now compute the differences across the row by converting the data frame row to a simple vector with unlist since diff is picky about what kind of data structure it will work with and insert a missing value, NA, for the first year:
row2[ , -1] <- c(X2019=NA, diff(unlist(row2[, -1])))
df.y <- rbind(row1, row2)
df.y
# Category X2019 X2020 X2021 X2022 X2023
# 4 Oranges 3 4 5 6 7
# 1 Apple NA 1 1 1 1
rownames(df.y) <- NULL
The last line just resets the row names which have carried over from df.x.

Transposing a dataframe and using the first column as an index

I have a dataframe that is structured like below, where A/B/C/D are different treatment methods:
input <- read.table(text="
filename wavelength A B C D
file1 w1 NA NA 1 2
file1 w2 NA NA 3 2
file1 w3 NA NA 6 2
file2 w1 3 4 NA NA
file2 w2 4 8 NA NA
file2 w3 6 1 NA NA", header=TRUE)
And I would like for it to be transposed so that wavelength is the header and treatments are rows with the filenames duplicated each time:
desired <- read.table(text="
filename Method w1 w2 w3
file1 C 1 3 6
file1 D 2 2 2
file2 A 3 4 6
file2 B 4 8 1", header=TRUE)
I've tried melt/cast from reshape2, melt from the data.table package, gather/spread, t - everything I can think of. The actual data frame in the end will be about 500 rows by 3500 columns - so I would prefer not to call out any specific column or method names. My issue seems mainly to be that I can't call all method columns under one value and use it to melt:
colMethods <- myData[, 2:length(myData)]
A lot of times I don't get an error, but the dataframe R returns is just a list of wavelengths and a column that says 'wavelength'. How would any of you approach this? Thanks!
You can try this:
library(tidyverse)
#Data
df <- structure(list(filename = c("file1", "file1", "file1", "file2",
"file2", "file2"), wavelength = c("w1", "w2", "w3", "w1", "w2",
"w3"), A = c(NA, NA, NA, 3L, 4L, 6L), B = c(NA, NA, NA, 4L, 8L,
1L), C = c(1L, 3L, 6L, NA, NA, NA), D = c(2L, 2L, 2L, NA, NA,
NA)), class = "data.frame", row.names = c(NA, -6L))
Code:
df %>% pivot_longer(cols = -c(1,2)) %>% filter(!is.na(value)) %>%
pivot_wider(names_from = wavelength,values_from = value)
Output:
# A tibble: 4 x 5
filename name w1 w2 w3
<chr> <chr> <int> <int> <int>
1 file1 C 1 3 6
2 file1 D 2 2 2
3 file2 A 3 4 6
4 file2 B 4 8 1
Here is data.table alternative using melt and dcast :
library(data.table)
dcast(melt(setDT(input), id.vars = 1:2, na.rm = TRUE),
variable+filename~wavelength, value.var = 'value')
# variable filename w1 w2 w3
#1: A file2 3 4 6
#2: B file2 4 8 1
#3: C file1 1 3 6
#4: D file1 2 2 2

R function to paste information from different rows with a common column? [duplicate]

This question already has an answer here:
dplyr::first() to choose first non NA value
(1 answer)
Closed 2 years ago.
I understand we can use the dplyr function coalesce() to unite different columns, but is there such function to unite rows?
I am struggling with a confusing incomplete/doubled dataframe with duplicate rows for the same id, but with different columns filled. E.g.
id sex age source
12 M NA 1
12 NA 3 1
13 NA 2 2
13 NA NA NA
13 F 2 NA
and I am trying to achieve:
id sex age source
12 M 3 1
13 F 2 2
You can try:
library(dplyr)
#Data
df <- structure(list(id = c(12L, 12L, 13L, 13L, 13L), sex = structure(c(2L,
NA, NA, NA, 1L), .Label = c("F", "M"), class = "factor"), age = c(NA,
3L, 2L, NA, 2L), source = c(1L, 1L, 2L, NA, NA)), class = "data.frame", row.names = c(NA,
-5L))
df %>%
group_by(id) %>%
fill(everything(), .direction = "down") %>%
fill(everything(), .direction = "up") %>%
slice(1)
# A tibble: 2 x 4
# Groups: id [2]
id sex age source
<int> <fct> <int> <int>
1 12 M 3 1
2 13 F 2 2
As mentioned by #A5C1D2H2I1M1N2O1R2T1 you can select the first non-NA value in each group. This can be done using dplyr :
library(dplyr)
df %>% group_by(id) %>% summarise(across(.fns = ~na.omit(.)[1]))
# A tibble: 2 x 4
# id sex age source
# <int> <fct> <int> <int>
#1 12 M 3 1
#2 13 F 2 2
Base R :
aggregate(.~id, df, function(x) na.omit(x)[1], na.action = 'na.pass')
Or data.table :
library(data.table)
setDT(df)[, lapply(.SD, function(x) na.omit(x)[1]), id]

Resources