melt() is using all column names as id variables - r

So, with ths dummy dataset
test_species <- c("a", "b", "c", "d", "e")
test_abundance <- c(4, 7, 15, 2, 9)
df <- rbind(test_species, test_abundance)
df <- as.data.frame(df)
colnames(df) <- c("a", "b", "c", "d", "e")
df <- dplyr::slice(df, 2)
we get a dataframe that's something like this:
a b c d e
4 7 15 2 9
I'd like to transform it into something like
species abundance
a 4
b 7
c 15
d 2
e 9
using the reshape2 function melt(). I tried the code
melted_df <- melt(df,
variable.name = "species",
value.name = "abundance")
but that tells me: "Using a, b, c, d, e as id variables", and the end result looks like this:
a b c d e
4 7 15 2 9
What am I doing wrong, and how can I fix it?

You can define it in the correct shape from the start, using only base library functions:
> data.frame(species=test_species, abundance=test_abundance)
species abundance
1 a 4
2 b 7
3 c 15
4 d 2
5 e 9

Rbind is adding some odd behaviour I think, I cannot explain why.
A fairly basic fix is:
test_species <-c("a", "b", "c", "d", "e")
test_abundance <- c(4, 7, 15, 2, 9)
df <- data.frame(test_species, test_abundance) #skip rbind and go straight to df
colnames(df) <- c('species', 'abundance') #colnames correct
This skips the rbind function and will give desired outcome.

Related

Giving IDs to groups in R [duplicate]

This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 1 year ago.
I have in R data frame that is divided to groups, like this:
Row
Group
1
A
2
B
3
A
4
D
5
C
6
B
7
C
8
C
9
A
10
B
I would like to add a uniaque numeric ID to each group, so finally I would have something like this:
Row
Group
ID
1
A
1
2
B
2
3
A
1
4
D
4
5
C
3
6
B
2
7
C
3
8
C
3
9
A
1
10
B
2
How could I achieve this?
Thank you very much.
Here is a simple way.
df1$ID <- as.integer(factor(df1$Group))
There are 3 solutions posted, mine, TarJae's and akrun's, they can be timed with increasing data sizes. akrun's is the fastest.
library(microbenchmark)
library(dplyr)
library(ggplot2)
funtest <- function(x, n){
out <- lapply(seq_len(n), function(i){
for(j in seq_len(i)) x <- rbind(x, x)
cat("nrow(x):", nrow(x), "\n")
mb <- microbenchmark(
match = with(x, match(Group, sort(unique(Group)))),
dplyr = x %>% group_by(Group) %>% mutate(ID = cur_group_id()),
intfac = as.integer(factor(x$Group))
)
mb$n <- i
mb
})
out <- do.call(rbind, out)
aggregate(time ~ ., out, median)
}
df1 %>%
funtest(10) %>%
ggplot(aes(n, time, colour = expr)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = 1:10, labels = 1:10) +
scale_y_continuous(trans = "log10") +
theme_bw()
Update
group_indices() was deprecated in dplyr 1.0.0.
Please use cur_group_id() instead.
df1 <- df %>%
group_by(Group) %>%
mutate(ID = cur_group_id())
First answer:
You can use group_indices
library(dplyr)
df1 <- df %>%
group_by(Group) %>%
mutate(ID = group_indices())
data
df <- tribble(
~Row, ~Group,
1, "A",
2, "B",
3, "A",
4, "D",
5, "C",
6, "B",
7, "C",
8, "C",
9, "A",
10,"B")
Row Group ID
<int> <chr> <int>
1 1 A 1
2 2 B 2
3 3 A 1
4 4 D 4
5 5 C 3
6 6 B 2
7 7 C 3
8 8 C 3
9 9 A 1
10 10 B 2
We can use match on the sorted unique values of 'Group' on the 'Group' to get the position index
df1$ID <- with(df1, match(Group, sort(unique(Group))))
data
df1 <- structure(list(Row = 1:10, Group = c("A", "B", "A", "D", "C",
"B", "C", "C", "A", "B")), class = "data.frame", row.names = c(NA,
-10L))

Subset a data frame based on count of values of column x. Want only the top two in R

here is the data frame
p <- c(1, 3, 45, 1, 1, 54, 6, 6, 2)
x <- c("a", "b", "a", "a", "b", "c", "a", "b", "b")
df <- data.frame(p, x)
I want to subset the data frame such that I get a new data frame with only the top two"x" based on the count of "x".
One of the simplest ways to achieve what you want to do is with the package data.table. You can read more about it here. Basically, it allows for fast and easy aggregation of your data.
Please note that I modified your initial data by appending the elements 10 and c to p and x, respectively. This way, you won't see a NA when filtering the top two observations.
The idea is to sort your dataset and then operate the function .SD which is a convenient way for subsetting/filtering/extracting observations.
Please, see the code below.
library(data.table)
p <- c(1, 3, 45, 1, 1, 54, 6, 6, 2, 10)
x <- c("a", "b", "a", "a", "b", "c", "a", "b", "b", "c")
df <- data.table(p, x)
# Sort by the group x and then by p in descending order
setorder( df, x, -p )
# Extract the first two rows by group "x"
top_two <- df[ , .SD[ 1:2 ], by = x ]
top_two
#> x p
#> 1: a 45
#> 2: a 6
#> 3: b 6
#> 4: b 3
#> 5: c 54
#> 6: c 10
Created on 2021-02-16 by the reprex package (v1.0.0)
Does this work for you?
Using dplyr:
library(dplyr)
df %>%
add_count(x) %>%
slice_max(n, n = 2)
p x n
1 1 a 4
2 3 b 4
3 45 a 4
4 1 a 4
5 1 b 4
6 6 a 4
7 6 b 4
8 2 b 4

Find out the row with different value with in same name [duplicate]

This question already has answers here:
How to remove rows that have only 1 combination for a given ID
(4 answers)
Selecting & grouping dual-category data from a data frame
(4 answers)
Closed 5 years ago.
I have a df looks like
df <- data.frame(Name = c("A", "A","A","B", "B", "C", "D", "E", "E"),
Value = c(1, 1, 1, 2, 15, 3, 4, 5, 5))
Basically, A is 1, B is 2, C is 3 and so on.
However, as you can see, B has "2" and "15"."15" is the wrong value and it should not be here.
I would like to find out the row which Value won't matches within the same Name.
Ideal output will looks like
B 2
B 15
you can use tidyverse functions like:
df %>%
group_by(Name, Value) %>%
unique()
giving:
Name Value
1 A 1
2 B 2
3 B 15
4 C 3
5 D 4
6 E 5
then, to keep only the Name with multiple Value, append above with:
df %>%
group_by(Name) %>%
filter( n() > 1)
Something like this? This searches for Names that are associated to more than 1 value and outputs one copy of each pair {Name - Value}.
df <- data.frame(Name = c("A", "A","A","B", "B", "C", "D", "E", "E"),
Value = c(1, 1, 1, 2, 15, 3, 4, 5, 5))
res <- do.call(rbind, lapply(unique(df$Name), (function(i){
if (length(unique(df[df$Name == i,]$Value)) > 1 ) {
out <- df[df$Name == i,]
out[!duplicated(out$Value), ]
}
})))
res
Result as expected
Name Value
4 B 2
5 B 15
Filter(function(x)nrow(unique(x))!=1,split(df,df$Name))
$B
Name Value
4 B 2
5 B 15
Or:
Reduce(rbind,by(df,df$Name,function(x) if(nrow(unique(x))>1) x))
Name Value
4 B 2
5 B 15

Matching data frames

I have this code:
df1 <- data.frame(letter=c("a", "a", "a", "b", "b", "c", "d", "e"),
value=c(NA))
df2 <- data.frame(letter=c("a", "b", "g", "f", "d", "e", "a", "b", "a", "c"),
number=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
I want to match these two data frames by letter and return the corresponding number in df2 into the Value column in df1.
So the result for df1 would look this this:
I want to match these two data frames by letter
This can be accomplished with merge and unique:
> df = unique(merge(df1['letter'], df2))
> df
letter number
1 a 1
2 a 9
3 a 7
10 b 2
11 b 8
14 c 10
15 d 5
16 e 6
- almost, since the number values for each letter might not be sorted as you want, so let's sort them
and return the corresponding number in df2 into the Value column in df1.
> df1$value = df$number[order(df$letter, df$number)]
> df1
letter value
1 a 1
2 a 7
3 a 9
4 b 2
5 b 8
6 c 10
7 d 5
8 e 6
Look at the package dplyr with the join functions :
library(dplyr)
inner_join(df1, df2) %>%
distinct(letter, number)

Drop last 5 columns from a dataframe without knowing specific number

I have a dataframe that is created by a for-loop with a changing number of columns.
In a different function I want the drop the last five columns.
The variable with the length of the dataframe is "units" and it has numbers between 10 an 150.
I have tried using the names of the columns to drop but it is not working. (As soon as I try to open "newframe" R studio crashes, viewing myframe is no problem).
drops <- c("name1","name2","name3","name4","name5")
newframe <- results[,!(names(myframe) %in% drops)]
Is there any way to just drop the last five columns of a dataframe without relying on names or numbers of the columns
length(df) can also be used:
mydf[1:(length(mydf)-5)]
You could use the counts of columns (ncol()):
df <- data.frame(x = rnorm(10), y = rnorm(10), z = rnorm(10), ws = rnorm(10))
# rm last 2 columns
df[ , -((ncol(df) - 1):ncol(df))]
# or
df[ , -seq(ncol(df)-1, ncol(df))]
Yo can take advantage of the list method for head() (which drops whole list elements, and works differently to the data.frame method which drops rows):
# data.frame with 26 columns (named a-z):
df <- setNames( as.data.frame( as.list(1:26)) , letters )
# drop last 5 'columns':
as.data.frame( head(as.list(df),-5) )
# a b c d e f g h i j k l m n o p q r s t u
#1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
My preferable method is using rev which makes the syntax cleaner. For mtcars data set
mtcars[-rev(seq_len(ncol(mtcars)))[1:5]]
Or using head (similar to Simons suggestion)
mtcars[head(seq_len(ncol(mtcars)), -5)]
A tidyverse option is to use last_col, where we first select the fifth column from the last column (i.e., last_col(offset = 4)) then to the last column number. Then, we use the - to remove the selected columns.
library(tidyverse)
df %>%
select(-(last_col(offset = 4):last_col()))
Output
x y z
1 1 10 5
2 2 9 5
3 3 8 5
4 4 7 5
5 5 6 5
6 6 5 5
7 7 4 5
8 8 3 5
9 9 2 5
10 10 1 5
Another option is to use ncol in the select:
df %>%
select(-((ncol(.) - 4):ncol(.)))
Or we could use tail with names:
df %>%
select(-tail(names(.), 5))
Data
df <- structure(list(x = 1:10, y = 10:1, z = c(5, 5, 5, 5, 5, 5, 5,
5, 5, 5), a = 11:20, b = c("a", "b", "c", "d", "e", "f", "g",
"h", "i", "j"), c = c("t", "s", "r", "q", "p", "o", "n", "m",
"l", "k"), d = 30:39, e = 50:59), class = "data.frame", row.names = c(NA,
-10L))
If you are using data.table package for your data processing, one nice way can be
drops <- c("name1","name2","name3","name4","name5")
df[, .SD, .SDcols=!drops]
In fact, this allows you to drop any variables as you like.

Resources