Construct dataframe with levels [duplicate] - r

This question already has answers here:
Unique combination of all elements from two (or more) vectors
(6 answers)
Closed 5 years ago.
I just migrated from Python to R and I would like to know if there is any function in R which is similar to pandas.MultiIndex.from_product?
Example:
letters <- c('a', 'b')
numbers <- c(1, 2, 3)
df <- somefunction(letters, numbers)
df
letters numbers
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
6 b 3

Yes:
> letters <- c('a', 'b')
> numbers <- c(1, 2, 3)
> expand.grid(letters=letters, numbers=numbers)
letters numbers
1 a 1
2 b 1
3 a 2
4 b 2
5 a 3
6 b 3
You can also use CJ from the data.table package. It is faster. But the result is not an ordinary dataframe, it is a datatable:
> library(data.table)
> CJ(letters=letters, numbers=numbers)
letters numbers
1: a 1
2: a 2
3: a 3
4: b 1
5: b 2
6: b 3

Related

There are different results between subset code in R [duplicate]

This question already has an answer here:
R subset with condition using %in% or ==. Which one should be used? [duplicate]
(1 answer)
Closed 2 years ago.
The results of:
BB= RB[RB$Rep, %in% c(“1”,”3”)] and
Bb=subset(RB,Rep ==c(“1”,”3”) )
are different.
Please tell me what the problem is?
When you use == the comparison is done in a sequential order.
Consider this example :
df <- data.frame(a = 1:6, b = c(1:3, 3:1))
df
# a b
#1 1 1
#2 2 2
#3 3 3
#4 4 3
#5 5 2
#6 6 1
When you use :
subset(df, b == c(1, 3))
# a b
#1 1 1
#4 4 3
1st value of b is compared with 1, 2nd with 3. Now as you have vector of shorter length, the values are recycled meaning 3rd value is again compared to 1, 4th value with 3 and so on until end of the dataframe. Hence, you get row 1 and 4 as output here.
When you use %in% it checks for either 1 or 3 is present in b. So it selects all the rows where value 1 or 3 is present in b.
subset(df, b %in% c(1, 3))
# a b
#1 1 1
#3 3 3
#4 4 3
#6 6 1

Sort data.frame or data.table using vector of column names [duplicate]

This question already has answers here:
Sort a data.table fast by Ascending/Descending order
(2 answers)
Order data.table by a character vector of column names
(2 answers)
Sort a data.table programmatically using character vector of multiple column names
(1 answer)
Closed 2 years ago.
I have a data.frame (a data.table in fact) that I need to sort by multiple columns. The names of columns to sort by are in a vector. How can I do it? E.g.
DF <- data.frame(A= 5:1, B= 11:15, C= c(3, 3, 2, 2, 1))
DF
A B C
5 11 3
4 12 3
3 13 2
2 14 2
1 15 1
sortby <- c('C', 'A')
DF[order(sortby),] ## How to do this?
The desired output is the following but using the sortby vector as input.
DF[with(DF, order(C, A)),]
A B C
1 15 1
2 14 2
3 13 2
4 12 3
5 11 3
(Solutions for data.table are preferable.)
EDIT: I'd rather avoid importing additional packages provided that base R or data.table don't require too much coding.
With data.table:
setorderv(DF, sortby)
which gives:
> DF
A B C
1: 1 15 1
2: 2 14 2
3: 3 13 2
4: 4 12 3
5: 5 11 3
For completeness, with setorder:
setorder(DF, C, A)
The advantage of using setorder/setorderv is that the data is reordered by reference and thus very fast and memory efficient. Both functions work on data.table's as wel as on data.frame's.
If you want to combine ascending and descending ordering, you can use the order-parameter of setorderv:
setorderv(DF, sortby, order = c(1L, -1L))
which subsequently gives:
> DF
A B C
1: 1 15 1
2: 3 13 2
3: 2 14 2
4: 5 11 3
5: 4 12 3
With setorder you can achieve the same with:
setorder(DF, C, -A)
Using dplyr, you can use arrange_at which accepts string column names :
library(dplyr)
DF %>% arrange_at(sortby)
# A B C
#1 1 15 1
#2 2 14 2
#3 3 13 2
#4 4 12 3
#5 5 11 3
Or with the new version
DF %>% arrange(across(sortby))
In base R, we can use
DF[do.call(order, DF[sortby]), ]
Also possible with dplyr:
DF %>%
arrange(get(sort_by))
But Ronaks answer is more elegant.

count unique combinations of variable values in an R dataframe column [duplicate]

This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Count number of rows within each group
(17 answers)
Closed 2 years ago.
I want to count the unique combinations of a variable that appear per group.
For example:
df <- data.frame(id = c(1,1,1,2,2,2,3,3,4,4,4,5,6,6,7,7,7),
status = c("a","b","c","a","b","c","b","c","b","c","d","b","b","c","b","c", "d"))
> df
id status
1 1 a
2 1 b
3 1 c
4 2 a
5 2 b
6 2 c
7 3 b
8 3 c
9 4 b
10 4 c
11 4 d
12 5 b
13 6 b
14 6 c
15 7 b
16 7 c
17 7 d
So that, for example, I can tally how many times a given combination of "status" appears.
By hand, for example, I see that "a,b,c" appears twice total (id's 1 and 2).
These seem to be similar questions, but I couldn't work out how to do it and with clearer explanation in R:
Counting unique combinations
Count of unique combinations despite order
The result I think I am looking for would be something like:
abc 2
bc 3
b 1
...
An option with tidyverse where group by 'id', paste the 'status' and get the count
library(dplyr)
library(stringr)
df %>%
group_by(id) %>%
summarise(status = str_c(status, collapse="")) %>%
count(status)
# A tibble: 4 x 2
# status n
# <chr> <int>
#1 abc 2
#2 b 1
#3 bc 2
#4 bcd 2
Here is a base R option via aggregate
> aggregate(.~status,rev(aggregate(.~id,df,paste0,collapse = "")),length)
status id
1 abc 2
2 b 1
3 bc 2
4 bcd 2
You can use the apply family of functions too with tapply and lapply to get there with table.
tap <- tapply(df$status, df$id ,FUN= function(x) unique(x))
lap <- lapply(tap,FUN = function(x) paste0(x,collapse=""))
status <- unlist(lap)
df1 <- data.frame(table(status))
> df1
status Freq
1 abc 2
2 b 1
3 bc 2
4 bcd 2

How do I only out put the rows that have column values from a vector in R? [duplicate]

This question already has answers here:
Select rows from a data frame based on values in a vector
(3 answers)
Closed 7 years ago.
I have a data.frame and a vector. I want to output only the rows from the data frame that have values in a column in common with the vector v.
For example:
v = (1,2,3,4,5)
df =
A B
1 a 2
2 b 6
3 c 4
4 d 1
5 e 8
What I want to do is, if df$b has any values of v in it then output the row. Basically if df$b[i] isn't in v then remove the row for i= 1:nrows(df)
output should be
A B
1 a 2
2 c 4
3 d 1
since 2,4 and 1 are in v.
You should make use of the %in% operator.
v <- c(1, 2, 3, 4, 5)
df <- read.table(text =
" A B
1 a 2
2 b 6
3 c 4
4 d 1
5 e 8", header = TRUE)
out <- df[df$B %in% v, ]
This gives:
A B
1 a 2
3 c 4
4 d 1

split a dataframe with numbers separated by the add sign '+' into new rows [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 6 years ago.
Sorry for the naive question but I have a dataframe like this:
n sp cap
1 1 a 3
2 2 b 3+2+4
3 3 c 2
4 4 d 1+5
I need to split the numbers separated by the add sign ("+") into new rows in order to the get a new dataframe like this below:
n sp cap
1 1 a 3
2 2 b 3
3 2 b 2
4 2 b 4
5 3 c 2
6 4 d 1
7 4 d 5
How can I do that? strsplit?
thanks in advance
We could use cSplit from splitstackshape
library(splitstackshape)
cSplit(df1, 'cap', sep="+", 'long')
# n sp cap
#1: 1 a 3
#2: 2 b 3
#3: 2 b 2
#4: 2 b 4
#5: 3 c 2
#6: 4 d 1
#7: 4 d 5
Or could do this in base R. Use strsplit to split the elements of "cap" column to substrings, which returns a list (lst), Replicate the rows of dataset by the length of each list element, subset the dataset based on the new index, convert the "lst" elements to "numeric", unlist, and cbind with the modified dataset.
lst <- strsplit(as.character(df1$cap), "[+]")
df2 <- cbind(df1[rep(1:nrow(df1), sapply(lst, length)),1:2],
cap= unlist(lapply(lst, as.numeric)))

Resources