Finding only unique value in each column in a d - r

I have the below data frame df1. (Edited to have different numbers of repeated value in the data frame.)
> dput(df1)
structure(list(...1 = c("a", "b", "c", "d", "e"), x = c(5, 10,
20, 20, 25), y = c(2, 6, 6, 6, 10), z = c(6, 2, 1, 8, 1)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
>df1
x y z
a 5 2 6
b 10 6 2
c 20 6 1
d 20 6 8
e 25 10 1
I would like to get a df2 which only has the unique values from each column 'x','y' and 'z'.
I tried:
df2<-apply(df1,2, unique)
df2 <- do.call(cbind, df2)
df2 <- as.data.frame(df2)
Desired output:
>df2
x y z
5 2 6
10 6 2
20 10 1
25 8

Tibbles can't have rownames so it creates a new column with it in your data. You can delete the first column and then use unique on all columns.
library(dplyr)
df1$...1 <- NULL
df1 %>% summarise(across(.fns = unique))
# x y z
# <dbl> <dbl> <dbl>
#1 5 2 6
#2 10 6 2
#3 20 8 1
#4 25 10 8
Or in base R :
df2 <- data.frame(sapply(df1, unique))
For unequal unique values in the column you could use :
tmp <- lapply(df1, unique)
data.frame(sapply(tmp, `[`, 1:max(lengths(tmp))))
# x y z
#1 5 2 6
#2 10 6 2
#3 20 10 1
#4 25 NA 8

Related

How can we use complete for alphabetic letters?

I have this dataframe:
df <- structure(list(x = c(1, 5, 6, 7, 8), y = c("a", "e", "f", "g",
"h")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-5L))
x y
<dbl> <chr>
1 1 a
2 5 e
3 6 f
4 7 g
5 8 h
With complete from tidyr package:
I can do:
df %>%
complete(x = full_seq(min(x):max(x), 1))
x y
<dbl> <chr>
1 1 a
2 2 NA
3 3 NA
4 4 NA
5 5 e
6 6 f
7 7 g
8 8 h
Now I would like to do the same with the y column:
df %>%
complete(y = full_seq(min(y):max(y), 1))
This obviously will not work.
How can I use complete from tidyr package for alphabetical order?
I don't think that's possible, especially because except in the case of 1 letter, it would not be possible to complete strings with more than one letter. You can still use the letters data set:
df %>%
complete(y = letters[full_seq(min(x):max(x), 1)])
or, to be entirely relying on y:
df %>%
complete(y = letters[which(letters == min(y)):which(letters == max(y))])
y x
1 a 1
2 b NA
3 c NA
4 d NA
5 e 5
6 f 6
7 g 7
8 h 8

R dataframe: how to "populate" missing data in df1 using df2

I am trying to populate the missing values of df1 with df2.
Whenever there is a valid value for the same cell in both df, I need to keep the value as in df1.
If there is a column in df2 that is not present in df1, this new column (z) has to be added to df1.
This would be a simple example:
id <- c (1, 2, 3, 4, 5)
x <- c (10, NA, 20, 50, 70)
y <- c (3, 5, NA, 6, 9)
df1 <- data.frame(id, x, y)
id <- c ( 2, 3, 5)
x <- c (10, NA, NA)
z <- c (NA, 6, 7)
df2 <- data.frame(id, x, z)
I would like to obtain "df3":
id x y z
1 1 10 3 NA
2 2 10 5 NA
3 3 20 6 6
4 4 50 6 NA
5 5 70 9 7
I tried several "merge" options that didn't work.
A 'merge' option after several extract and replace steps could be
idx <- is.na(df1[df2$id,])
df1[df2$id,][idx] <- df2[idx]
out <- merge(df1, df2[, c("id", "z")], by = "id", all.x = TRUE)
Result
out
# id x y z
#1 1 10 3 NA
#2 2 10 5 NA
#3 3 20 6 6
#4 4 50 6 NA
#5 5 70 9 7

How to a create a new dataframe of consolidated values from multiple columns in R

I have a dataframe, df1, that looks like the following:
sample
99_Ape_1
93_Cat_1
87_Ape_2
84_Cat_2
90_Dog_1
92_Dog_2
A
2
3
1
7
4
6
B
5
9
7
0
3
7
C
6
8
9
2
3
0
D
3
9
0
5
8
3
I want to consolidate the dataframe by summing the values based on animal present in the header row, i.e. by "Ape", "Cat", "Dog", and end up with the following dataframe:
sample
Ape
Cat
Dog
A
3
10
10
B
12
9
10
C
15
10
3
D
3
14
11
I have created a list that represents all the animals called "animals_list"
I have then created a list of dataframes that subsets each animal into a separate dataframe with:
animals_extract <- c()
for (i in 1:length(animals_list)){
species_extract[[i]] <- df1[, grep(animals_list[i], names(df1))]
}
I am then trying to sum each variable in the row by sample:
for (i in 1:length(species_extract)){
species_extract[[i]]$total <- rowSums(species_extract[[i]])
}
and then create a dataframe 'animal_total' by binding all values in the new 'total' column.
animal_total <- NULL
for (i in 1:length(species_extract)){
animal_total[i] <- cbind(species_extract[[i]]$total)
}
Unfortunately, this doesn't seem to work at all and I think I may have taken the wrong route. Any help would be really appreciated!
EDIT: my dataframe has over 300 animals, meaning incorporating use of my list of identifiers (animals_list) would be highly appreciated! I would also note that some column names do not follow the structure, "number_animal_number" and therefore I can't use a repetitive search (sorry!).
a data.table approach
library(data.table)
library(rlist)
#set data to data.table format
setDT(df1)
# split column 2:n by regex on column names
L <- split.default(df1[,-1], gsub(".*_(.*)_.*", "\\1", names(df1)[-1]))
# Bind together again
data.table(sample = df1$sample,
as.data.table(list.cbind(lapply(L, rowSums))))
# sample Ape Cat Dog
# 1: A 3 10 10
# 2: B 12 9 10
# 3: C 15 10 3
# 4: D 3 14 11
Update: After clarification:
This may work depending on the other names of your animals. but this is a start:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(
cols = -sample
) %>%
mutate(name1 = str_extract(name, '(?<=\\_)(.*?)(?=\\_)')) %>%
group_by(sample, name1) %>%
summarise(sum=sum(value)) %>%
pivot_wider(
names_from = name1,
values_from= sum
)
Output:
sample Ape Cat Dog
<chr> <int> <int> <int>
1 A 3 10 10
2 B 12 9 10
3 C 15 10 3
4 D 3 14 11
First answer:
Here is how we could do it with dplyr:
library(dplyr)
df %>%
mutate(Cat = rowSums(select(., contains("Cat"))),
Ape = rowSums(select(., contains("Ape"))),
Dog = rowSums(select(., contains("Dog")))) %>%
select(sample, Cat, Ape, Dog)
sample Ape Cat Dog
<chr> <int> <int> <int>
1 A 3 10 10
2 B 12 9 10
3 C 15 10 3
4 D 3 14 11
An alternative data.table solution
library(data.table)
# Construct data table
dt <- as.data.table(list(sample = c("A", "B", "C", "D"),
`99_Ape_1` = c(2, 5, 6, 3),
`93_Cat_1` = c(3, 9, 8, 9),
`87_Ape_2` = c(1, 7, 9, 0),
`84_Cat_2` = c(7, 0, 2, 5),
`90_Dog_1` = c(4, 3, 3, 8),
`92_Dog_2` = c(6, 7, 0, 3)))
# Alternatively convert existing dataframe
# dt <- setDT(df)
# Use Regex pattern to drop ids from column names
names(dt) <- gsub("((^[0-9_]{3})|(_[0-9]{1}$))", "", names(dt))
# Pivot long (columns to rows)
dt <- melt(dt, id.vars = "sample")
# Aggregate sample by variable
dt <- dt[, .(value=sum(value)), by=.(sample, variable)]
# Unpivot (rows to colums)
dcast(dt, sample ~ variable)
# sample Ape Cat Dog
# 1: A 3 10 10
# 2: B 12 9 10
# 3: C 15 10 3
# 4: D 3 14 11
Alternatively, leaving the column names as is (after comment from OP to previous answer) and assuming that there are multiple observations of the same samples:
dt <- as.data.table(list(sample = c("A", "B", "C", "D", "A"),
`99_Ape_1` = c(2, 5, 6, 3, 1),
`93_Cat_1` = c(3, 9, 8, 9, 1),
`87_Ape_2` = c(1, 7, 9, 0, 1),
`84_Cat_2` = c(7, 0, 2, 5, 1),
`90_Dog_1` = c(4, 3, 3, 8, 1),
`92_Dog_2` = c(6, 7, 0, 3, 1)))
dt
# sample 99_Ape_1 93_Cat_1 87_Ape_2 84_Cat_2 90_Dog_1 92_Dog_2
# 1: A 2 3 1 7 4 6
# 2: B 5 9 7 0 3 7
# 3: C 6 8 9 2 3 0
# 4: D 3 9 0 5 8 3
# 5: A 1 1 1 1 1 1
# Pivot long (columns to rows)
dt <- melt(dt, id.vars = "sample")
# Aggregate sample by variable
dt <- dt[, .(value=sum(value)), by=.(sample, variable)]
# Unpivot (rows to colums)
dcast(dt, sample ~ variable)
# sample 99_Ape_1 93_Cat_1 87_Ape_2 84_Cat_2 90_Dog_1 92_Dog_2
# 1: A 3 4 2 8 5 7
# 2: B 5 9 7 0 3 7
# 3: C 6 8 9 2 3 0
# 4: D 3 9 0 5 8 3

R: Merge two data frames based on value in column and return all values of both data frames

Let's say I have the following dfs
df1:
a b c d
1 2 3 4
4 3 3 4
9 7 3 4
df2:
a b c d
1 2 3 4
2 2 3 4
3 2 3 4
Now I want to merge both dfs conditional of column "a" to give me the following df
a b c d
1 2 3 4
4 3 3 4
9 7 3 4
2 2 3 4
3 2 3 4
In my dataset i tried using
merge <- merge(x = df1, y = df2, by = "a", all = TRUE)
However, while df1 has 50,000 entries and df2 has 100,000 entries and there are definately matching values in column a the merged df has over one million entries. I do not understand this. As I understand there should be max. 150,000 entries in the merged df and this is the case when no values in column a are equal between the two dfs.
I think what you want to do is not mergebut rather rbind the two dataframes and remove the duplicated rows:
DATA:
df1 <- data.frame(a = c(1,4,9),
b = c(2,3,7),
c = c(3,3,3),
d = c(4,4,4))
df2 <- data.frame(a = c(1,2,3),
b = c(2,2,2),
c = c(3,3,3),
d = c(4,4,4))
SOLUTION:
Row-bind df1and df2:
df3 <- rbind(df1, df2)
Remove the duplicate rows:
df3 <- df3[!duplicated(df3), ]
RESULT:
df3
a b c d
1 1 2 3 4
2 4 3 3 4
3 9 7 3 4
5 2 2 3 4
6 3 2 3 4
With tidyverse, we can do bind_rows and distinct
library(dplyr)
bind_rows(df1, df2) %>%
distinct
data
df1 <- structure(list(a = c(1, 4, 9), b = c(2, 3, 7), c = c(3, 3, 3),
d = c(4, 4, 4)), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(a = c(1, 2, 3), b = c(2, 2, 2), c = c(3, 3, 3),
d = c(4, 4, 4)), class = "data.frame", row.names = c(NA,
-3L))
it is possible so
dplyr::union(df1, df2)
here is another base R solution using rbind + %in%
dfout <- rbind(df1,subset(df2,!a %in% df1$a))
such that
> rbind(df1,subset(df2,!a %in% df1$a))
a b c d
1 1 2 3 4
2 4 3 3 4
3 9 7 3 4
21 2 2 3 4
31 3 2 3 4

Removing row with duplicated values in all columns of a data frame (R)

With the following data frame:
d <- structure(list(n = c(2, 3, 5), s = c(2, 8, 3),t = c(2, 18, 30)), .Names = c("n", "s","t"), row.names = c(NA, -3L), class = "data.frame")
which looks like:
> d
n s t
1 2 2 2
2 3 8 18
3 5 3 30
How can I remove row with duplicated values in all column.
Yielding:
n s t
2 3 8 18
3 5 3 30
Here's one possible approach, which compares all columns to the first
d[rowSums(d == d[,1]) != ncol(d),]
# n s t
# 2 3 8 18
# 3 5 3 30

Resources