I have a grouped data set that looks like this:
data = data.frame(group = c(1,1,1,1,2,2,2,2),
c1 = c("A", "E", "A", "J", "L", "M", "L", "J"),
c2 = c("B", "F", "F", "K", "B", "F", "T", "E"),
c3 = c("C", "G", "C", "L", "C", "X", "C", "V"),
c4 = c("D", "H", "I", "M", "D", "T", "I", "W"))
And I need to calculate the number of values in each row that are not duplicated within each group. For example, something that looks like this:
group c1 c2 c3 c4 uniq.vals
1 1 A B C D 2
2 1 E F G H 3
3 1 A F C I 1
4 1 J K L M 4
5 2 L B C D 2
6 2 M F X T 3
7 2 L T C I 1
8 2 J E V W 4
The count for row 1 would be 2, because B and D do not show up in any of the other rows within group 1.
I am familiar with using group_by and summarize but I am having trouble extending that to this particular situation, which requires that each value be checked across multiple columns and rows. For example, n_distinct on its own would not work because I'm looking for non-duplicated values, not unique values.
Ideally the solution would also ignore NAs and not count them as duplicated or non-duplicated values.
Here is an option with tidyverse. Reshape to 'long' format with pivot_longer, grouped by 'group', replace all the duplicate 'value' to NA, then grouped by row number, summarise to get the counts with n_distinct (number of distinct elements), and bind with the original data
library(dplyr)
library(tidyr)
data %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = starts_with('c')) %>%
group_by(group) %>%
mutate(value = replace(value, duplicated(value)|duplicated(value,
fromLast = TRUE), NA)) %>%
group_by(rn) %>%
summarise(uniq.vals = n_distinct(value, na.rm = TRUE), .groups = 'drop') %>%
select(uniq.vals) %>%
bind_cols(data, .)
-output
# group c1 c2 c3 c4 uniq.vals
#1 1 A B C D 2
#2 1 E F G H 3
#3 1 A F C I 1
#4 1 J K L M 4
#5 2 L B C D 2
#6 2 M F X T 3
#7 2 L T C I 1
#8 2 J E V W 4
In base R you would do:
a <- tapply(unlist(data[-1]), data$group[row(data[-1])],table)
data$uniq.vals <- c(by(data, seq(nrow(data)),
function(x)sum(a[[x[,1]]][unlist(x[-1])]<2)))
group c1 c2 c3 c4 uniq.vals
1 1 A B C D 2
2 1 E F G H 3
3 1 A F C I 1
4 1 J K L M 4
5 2 L B C D 2
6 2 M F X T 3
7 2 L T C I 1
8 2 J E V W 4
Note that in your case, row 3 should have 1 since only I is the unique value
Related
I have this dataframe
a <- c("a", "f", "n", "c", "d")
b <- c("L", "S", "N", "R", "S")
df <- data.frame(a,b)
a b
1 a L
2 f S
3 n N
4 c R
5 d S
Then I want the rows be ordered by column b, but first setting at the beginning the rows with "S" value and then alphabetically:
a b
2 f S
5 d S
1 a L
3 n N
4 c R
You can exchange the S to a space during order.
df[order(sub("S", " ", df$b)), ]
#df[order(chartr("S", " ", df$b)), ] #Alternative
# a b
#2 f S
#5 d S
#1 a L
#3 n N
#4 c R
Here is one option using factor.
df[order(factor(df$b, unique(c('S', sort(df$b))))), ]
# a b
#2 f S
#5 d S
#1 a L
#3 n N
#4 c R
Using dplyr
library(dplyr)
df %>%
arrange(b != 'S', b)
a b
1 f S
2 d S
3 a L
4 n N
5 c R
Or in base R
df[order(df$b != "S", df$b),]
a b
2 f S
5 d S
1 a L
3 n N
4 c R
This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 1 year ago.
for example I have a dataset that looks like this
structure(list(ID = c(1, 2, 3, 4, 5), COL1 = c("A", "B", "C",
"D", "E"), COL2 = c("F", "G", "H", "I", "J"), Paired = c(2, 3,
1, 2, 1)), row.names = c(NA, -5L), class = c("tbl_df", "tbl",
"data.frame"))
ID COL1 COL2 Paired
1 A F 2
2 B G 3
3 C H 1
4 D I 2
5 E J 1
I would like to create a dataset that looks like this. Note the number in the paired column
Col Col2
A F
A F
F A
F A
B G
B G
B G
G B
G B
G B
C H
H C
D I
D I
I D
I D
E J
J E
Note that A F is paired up two times. I want it basically to show in long the two times A and F paired in both combination scenario so 2 pairs is AF, AF, FA, FA.
We can use
library(dplyr)
library(tidyr)
df1 %>%
uncount(Paired) -> tmp
tmp %>%
rename(COL1= COL2, COL2 = COL1) %>%
bind_rows(tmp) %>%
select(-ID)
-output
A tibble: 18 x 2
COL2 COL1
<chr> <chr>
1 A F
2 A F
3 B G
4 B G
5 B G
6 C H
7 D I
8 D I
9 E J
10 F A
11 F A
12 G B
13 G B
14 G B
15 H C
16 I D
17 I D
18 J E
I have a data frame with five thousands rows. I need to create a new column with a unique identifier based on column "gender", then the number 21, and a sequential number starting on 0001. It is important that the sequential number restarts with a different letter in column "gender" (gender + 21 + seq#).
df <- data_frame(
name = c("A", "B", "C", "D", "E", "F", "G", "H", "I"),
gender = c("F", "F", "F", "M","M","F","M","F","F")
)
df
name gender
<chr> <chr>
1 A F
2 B F
3 C F
4 D M
5 E M
6 F F
7 G M
8 H F
9 I F
With unique identifier:
df
name gender id
1 A F F210001
2 B F F210002
3 C F F210003
4 D M M210001
5 E M M210002
6 F F F210004
7 G M M210003
8 H F F210005
9 I F F210006
Any help on how to achieve this will be very appreciated.
An option is paste with rowid
library(dplyr)
library(stringr)
library(data.table)
df1 <- df %>%
mutate(id = str_c(gender, rowid(gender) + 210000))
Or do a group_by/row_number
df1 <- df %>%
group_by(gender) %>%
mutate(id = str_c(cur_group(), row_number() + 210000)) %>%
ungroup
in base R you could use ave:
transform(df, group = ave(gender, gender, FUN = function(x)sprintf("%s21%04d",x,seq(x))))
name gender group
1 A F F210001
2 B F F210002
3 C F F210003
4 D M M210001
5 E M M210002
6 F F F210004
7 G M M210003
8 H F F210005
9 I F F210006
Aim of this project is understand how information is acquired while looking into an object. Imagine an object has elements like a, b, c, d, e and f. A person might look at a and move onto to b and so forth. Now, we wish to plot and understand how that person have navigated across the different elements of a given stimuli. I have data that captured this movement in a single column but I need split this into few columns to get the navigation pattern. Please find the example given below.
I have column extracted from a data frame. Now it has to be split into four columns based on its characteristics.
a <- c( "a", "b", "b", "b", "a", "c", "a", "b", "d", "d", "d", "e", "f", "f", "e", "e", "f")
a <- as.data.frame(a)
Expected output
from to countfrom countto
a b 1 3
b a 3 1
a c 1 1
c a 1 1
a b 1 1
b d 1 3
d e 3 1
e f 1 2
f e 2 2
e f 2 1
Note: I used dplyr to extract from the dataframe.
Use rle to get the relative runs of each letter, and then piece it together:
r <- rle(a$a)
## or maybe `r <- rle(as.character(a$a)` depending on your R version
setNames(
data.frame(lapply(r, head, -1), lapply(r, tail, -1)),
c("countfrom","from","countto","to")
)
## countfrom from countto to
##1 1 a 3 b
##2 3 b 1 a
##3 1 a 1 c
##4 1 c 1 a
##5 1 a 1 b
##6 1 b 3 d
##7 3 d 1 e
##8 1 e 2 f
##9 2 f 2 e
##10 2 e 1 f
Or in the tidyverse
library(tidyverse)
a <- c( "a", "b", "b", "b", "a", "c", "a", "b", "d",
"d", "d", "e", "f", "f", "e", "e", "f")
foo <- rle(a)
answ <- tibble(from = foo$values, to = lead(foo$values),
fromCount = foo$lengths, toCount = lead(foo$lengths)) %>%
filter(!is.na(to))
# A tibble: 10 x 4
from to fromCount toCount
<chr> <chr> <int> <int>
1 a b 1 3
2 b a 3 1
3 a c 1 1
4 c a 1 1
5 a b 1 1
6 b d 1 3
7 d e 3 1
8 e f 1 2
9 f e 2 2
10 e f 2 1
In Rstudio, I have a dataframe which contains 4 columns and I need to get the list of every different triplet of the 3 first columns sorted decreasingly by the sum on the 4th column. For example, with:
A B C 2
D E F 5
A B C 4
G H I 5
D E F 3
I need as a result:
D E F 8
A B C 6
G H I 5
I've tried the following different approach but I can't manage to have exactly the result I need:
df_list<-df_raw_data %>%
group_by(param1, param2, param3) %>%
summarise_all(total = sum(param4))
arrange(df_list, desc(total))
and:
df_list<-unique(df_raw_data[, c('param1', 'param2', 'param3')])
cbind(df_list, total)
for(i in 1:nrow(df_raw_data))
{
filter ???????????
}
I would prefer to use the dplyr package since it's a more elegant solution.
EDIT: Okay, thanks for your working answers. I think that I've lost some time figuring out that the plyr package shouldn't be loaded after dplyr...
We can use group_by_at to select the columns to group.
library(dplyr)
dat2 <- dat %>%
group_by_at(vars(-V4)) %>%
summarise(V4 = sum(V4)) %>%
ungroup()
dat2
# # A tibble: 3 x 4
# V1 V2 V3 V4
# <chr> <chr> <chr> <int>
# 1 A B C 6
# 2 D E F 8
# 3 G H I 5
Or use group_by_if to select columns to group based on column types.
dat2 <- dat %>%
group_by_if(is.character) %>%
summarise(V4 = sum(V4)) %>%
ungroup()
dat2
# # A tibble: 3 x 4
# V1 V2 V3 V4
# <chr> <chr> <chr> <int>
# 1 A B C 6
# 2 D E F 8
# 3 G H I 5
DATA
dat <- read.table(text = "A B C 2
D E F 5
A B C 4
G H I 5
D E F 3",
header = FALSE, stringsAsFactors = FALSE)
Would this be what you are looking for?
df <- data_frame(var1 = c("A", "D", "A", "G", "D"),
var2 = c("B", "E", "B", "H", "E"),
var3 = c("C", "F", "C", "I", "F"),
var4 = c(2, 5, 4, 5, 3))
df %>% group_by(var1, var2, var3) %>%
summarise(sum = sum(var4)) %>%
arrange(desc(sum))