Here is my data.
dat<-read.table(text=" MP1 MP2 MP3 N1 N2 N3 WP1 WP2 WP3
A A A Y Y Y 10 11 11
A B A Y Y Y 10 11 11
B B A Y Y Y 10 10 11
A B A Y Y Y 11 11 10
B B A Y Y Y 10 10 11
B B A N Y Y 11 10 10
B C A Y Y Y 11 11 11
C C B Y Y N 10 11 10
B C B Y Y Y 11 11 11
B C B Y N Y 10 11 11
",header=TRUE)
I want to get this table. Indeed I want to get three columns instead of nine columns. These columns are named as follows:
MP N WP
A Y 10
A Y 10
B Y 10
A Y 11
B Y 10
B N 11
B Y 11
C Y 10
B Y 11
B Y 10
A Y 11
B Y 11
B Y 10
B Y 11
B Y 10
B Y 10
C Y 11
C Y 11
C Y 11
C N 11
A Y 11
A Y 11
A Y 11
A Y 10
A Y 11
A Y 10
A Y 11
B N 10
B Y 11
B Y 11
I have tried this:
dat1 <- data.frame(MP=unlist(dat, use.names = FALSE))
But, not sure why it does not work. I also used
dat2 <- data.frame(MP = c(dat[,"MP"], dat[,"N"],dat[,WP])))
Here's another base R approach that preserves the factors:
names(dat) <- c(rep("MP", 3), rep("N", 3), rep("WP", 3))
rdat2 <- rbind(dat[, c(1, 4, 7)], dat[, c(2, 5, 8)], dat[, c(3, 6, 9)])
str(rdat2)
# 'data.frame': 30 obs. of 3 variables:
# $ MP: Factor w/ 3 levels "A","B","C": 1 1 2 1 2 2 2 3 2 2 ...
# $ N : Factor w/ 2 levels "N","Y": 2 2 2 2 2 1 2 2 2 2 ...
# $ WP: int 10 10 10 11 10 11 11 10 11 10 ...
An option is pivot_longer, specify the cols argument as everything() (as we are using all the columns), also the separation in column names is between the numbers and the uppercase letters, so we can use a regex lookaround to do the split at that junction
library(dplyr)
library(tidyr)
dat %>%
pivot_longer(cols = everything(), names_to = c( ".value", "grp"),
names_sep="(?<=[A-Z])(?=[0-9])") %>%
select(-grp)
# A tibble: 30 x 3
# MP N WP
# <fct> <fct> <int>
# 1 A Y 10
# 2 A Y 11
# 3 A Y 11
# 4 A Y 10
# 5 B Y 11
# 6 A Y 11
# 7 B Y 10
# 8 B Y 10
# 9 A Y 11
#10 A Y 11
# … with 20 more rows
Or with melt from data.table
library(data.table)
melt(setDT(dat), measure = patterns("^MP", "^N", "^WP"),
value.name = c("MP", "N","WP"))[, variable := NULL][]
A quick solution using base R is:
as.data.frame(sapply(c("MP", "N", "WP"), function(x) unlist(dat[grep(x, names(dat))]), simplify = FALSE))
Related
group = c(1,1,4,4,4,5,5,6,1,4,6,1,1,1,1,6,4,4,4,4,1,4,5,6)
animal = c('a','b','c','c','d','a','b','c','b','d','c','a','a','a','a','c','c','c','c','c','a','c','a','c')
sleep = c('y','n','y','y','y','n','n','y','n','y','n','y','y','n','m','y','n','n','n','n',NA, NA, NA, NA)
test = data.frame(group, animal, sleep)
print(test)
group_animal = test %>% group_by(`group`, `animal`) %>% count(sleep)
print(group_animal)
I would like to replace the NA values in the test df's sleep column by the highest count of sleep answer based on group and animal.
Such that Group 1, Animal a with NAs in the sleep column should have a sleep value of 'y' because that is the value with the highest count among Group 1 Animal a.
Group 4 animal c with NAs for sleep should have 'n' as the sleep value as well.
Another option is replacing the NAs with the Mode. You can use the Mode function from this post in the na.aggregate function from zoo to replace these NAs like this:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
group = c(1,1,4,4,4,5,5,6,1,4,6,1,1,1,1,6,4,4,4,4,1,4,5,6)
animal = c('a','b','c','c','d','a','b','c','b','d','c','a','a','a','a','c','c','c','c','c','a','c','a','c')
sleep = c('y','n','y','y','y','n','n','y','n','y','n','y','y','n','m','y','n','n','n','n',NA, NA, NA, NA)
test = data.frame(group, animal, sleep)
library(dplyr)
library(zoo)
test %>%
group_by(group, animal) %>%
mutate(sleep = na.aggregate(sleep , FUN=Mode)) %>%
ungroup()
#> # A tibble: 24 × 3
#> group animal sleep
#> <dbl> <chr> <chr>
#> 1 1 a y
#> 2 1 b n
#> 3 4 c y
#> 4 4 c y
#> 5 4 d y
#> 6 5 a n
#> 7 5 b n
#> 8 6 c y
#> 9 1 b n
#> 10 4 d y
#> # … with 14 more rows
#> # ℹ Use `print(n = ...)` to see more rows
Created on 2022-07-26 by the reprex package (v2.0.1)
Here is tail of output:
> tail(test)
# A tibble: 6 × 3
group animal sleep
<dbl> <chr> <chr>
1 4 c n
2 4 c n
3 1 a y
4 4 c n
5 5 a n
6 6 c y
Update now with group_by(group, animal) thnx #Quinten, removed prior answer:
group by animal
use replace_na with the replace argument as sleep[n==max(n)]
new: in case of ties like in group 5 add !is.na(sleep) to avoid conflicts:
library(dplyr)
library(tidyr)
group_animal %>%
group_by(group, animal) %>%
arrange(desc(sleep), .by_group = TRUE) %>%
mutate(sleep = replace_na(sleep, sleep[n==max(n) & !is.na(sleep)]))
group animal sleep n
<dbl> <chr> <chr> <int>
1 1 a y 3
2 1 a n 1
3 1 a m 1
4 1 a y 1
5 1 b n 2
6 4 c y 2
7 4 c n 4
8 4 c n 1
9 4 d y 2
10 5 a n 1
11 5 a n 1
12 5 b n 1
13 6 c y 2
14 6 c n 1
15 6 c y 1
Try this.
This method essential creates a custom column to coalesce with sleep, it subsets sleep based on the max count values obtained from str_count
library(dplyr)
test |>
group_by(group, animal) |>
mutate(sleep = coalesce(sleep, sleep[max(stringr::str_count(paste(sleep, collapse = ""), pattern = sleep), na.rm = TRUE)])) |>
ungroup()
group animal sleep
1 1 a y
2 1 b n
3 4 c y
4 4 c y
5 4 d y
6 5 a n
7 5 b n
8 6 c y
9 1 b n
10 4 d y
11 6 c n
12 1 a y
13 1 a y
14 1 a n
15 1 a m
16 6 c y
17 4 c n
18 4 c n
19 4 c n
20 4 c n
21 1 a y
22 4 c n
23 5 a n
24 6 c n
Example data:
tibbly = tibble(age = c(10,30,50,10,30,50,10,30,50,10,30,50),
grouping1 = c("A","A","A","A","A","A","B","B","B","B","B","B"),
grouping2 = c("X", "X", "X","Y","Y","Y","X","X","X","Y","Y","Y"),
value = c(1,2,3,4,4,6,2,5,3,6,3,2))
> tibbly
# A tibble: 12 x 4
age grouping1 grouping2 value
<dbl> <chr> <chr> <dbl>
1 10 A X 1
2 30 A X 2
3 50 A X 3
4 10 A Y 4
5 30 A Y 4
6 50 A Y 6
7 10 B X 2
8 30 B X 5
9 50 B X 3
10 10 B Y 6
11 30 B Y 3
12 50 B Y 2
Question:
How to obtain the order of rows for each group in a dataframe? I can use dplyr to arrange the data in the an appropriate form to visualize what I am interested in:
> tibbly %>%
group_by(grouping1, grouping2) %>%
arrange(grouping1, grouping2, desc(value))
# A tibble: 12 x 4
# Groups: grouping1, grouping2 [4]
age grouping1 grouping2 value
<dbl> <chr> <chr> <dbl>
1 50 A X 3
2 30 A X 2
3 10 A X 1
4 50 A Y 6
5 10 A Y 4
6 30 A Y 4
7 30 B X 5
8 50 B X 3
9 10 B X 2
10 10 B Y 6
11 30 B Y 3
12 50 B Y 2
In the end I am interested in the order of the age column, for each group based on the value column. Is there a elegant way to do this with dplyr? Something like summarise() based on the order of rows and not actual values
library(dplyr)
tibbly = tibble(age = c(10,30,50,10,30,50,10,30,50,10,30,50),
grouping1 = c("A","A","A","A","A","A","B","B","B","B","B","B"),
grouping2 = c("X", "X", "X","Y","Y","Y","X","X","X","Y","Y","Y"),
value = c(1,2,3,4,4,6,2,5,3,6,3,2))
tibbly %>%
group_by(grouping1, grouping2) %>% # for each group
arrange(desc(value)) %>% # arrange value descending
summarise(order = paste0(age, collapse = ",")) %>% # get the order of age as a strings
ungroup() # forget the grouping
# # A tibble: 4 x 3
# grouping1 grouping2 order
# <chr> <chr> <chr>
# 1 A X 50,30,10
# 2 A Y 50,10,30
# 3 B X 30,50,10
# 4 B Y 10,30,50
With data.table
library(data.table)
setDT(tibbly)[order(-value), .(order = toString(age)),.(grouping1, grouping2)]
I have very large data.table that I want to trim down in this fashion:
Only one unique id
If there is any other data than "X" in the same log, that other should stay
If only X, then the first X should stay
If there is more than one other than "X", then all those should stay, separated by commas, but not the "X".
Sample dataset:
library(data.table)
dt <- data.table(
id=c(1,1,2,3,3,4,4,4,5,5),
log=c(11,11,11,12,12,12,12,12,13,13),
art=c("X", "Y", "X", "X", "X", "Z", "X", "Y","X", "X")
)
dt
id log art
1: 1 11 X
2: 1 11 Y
3: 2 11 X
4: 3 12 X
5: 3 12 X
6: 4 12 Z
7: 4 12 X
8: 4 12 Y
9: 5 13 X
10: 5 13 X
Required output:
id log art
1 11 Y
2 11 Y
3 12 Z,Y
4 12 Z,Y
5 13 X
Here is one method, though there maybe a more efficient approach.
unique(dt[,.(id, log)])[dt[, .(art=if(.N == 1 | all(art == "X"))
art[1] else toString(unique(art[art != "X"]))),
by=log], on="log"]
which returns
id log art
1: 1 11 Y
2: 2 11 Y
3: 3 12 Z, Y
4: 4 12 Z, Y
5: 5 13 X
perform a left join of the desired values of art by each log onto the unique pairs of ID and log. This assumes that no ID spans two logs, which is the case in the example.
We can try
dt[, .(art = if(all(art=="X")) "X" else
toString(unique(art[art != "X"]))), .(id, logbld = log)]
# id logbld art
#1: 1 11 Y
#2: 2 11 X
#3: 3 12 X
#4: 4 12 Z, Y
#5: 5 13 X
Just wanted to try this with dplyr:
library(data.table)
library(dplyr)
dat <- setDT(dt %>% group_by(id) %>%
unique() %>%
summarise(bldlog = mean(log),
art = gsub("X,|,X", "",paste(art, collapse = ","))))
dat
# id bldlog art
# 1: 1 11 Y
# 2: 2 11 X
# 3: 3 12 X
# 4: 4 12 Z,Y
# 5: 5 13 X
I have a "toy" data frame with 2 columns (x and y) and 8 rows. I would like to merge and sum all rows where y < 10. Value of merged x is not very important.
x = c("A","B","C","D","E","F","G","H")
y = c(20,17,16,14,12,9,6,5)
df = data.frame(x,y)
df
x y
1 A 20
2 B 17
3 C 16
4 D 14
5 E 12
6 F 9
7 G 6
8 H 5
Desired output:
x y
1 A 20
2 B 17
3 C 16
4 D 14
5 E 12
6 F 20
F is not necessary and can be set to Other. Thanks in advance!
I think this is what you are looking for.
x = c("A","B","C","D","E","F","G","H")
y = c(20,17,16,14,12,9,6,5)
df = data.frame(x = x[which(y > 10)],y = y[which(y > 10)])
df = rbind(df,data.frame(x = 'f',y = sum(y[which(y < 10)])))
We can also try with subset/transform/rbind
rbind(subset( df, y>=10),
transform(subset(df, y<10), x= x[1], y= sum(y))[1,])
In the following dataset:
Day Place Name
22 X A
22 X A
22 X B
22 X A
22 Y C
22 Y C
22 Y D
23 X B
23 X A
How can I assign numbering to the variable Name in following order using R:
Day Place Name Number
22 X A 1
22 X A 1
22 X B 2
22 X A 1
22 Y C 1
22 Y C 1
22 Y D 2
23 X B 1
23 X A 2
In a nutshell, I need to number the names according to their order to occurrence on a certain day and at a certain place.
In base R using tapply:
dat$Number <-
unlist(tapply(dat$Name,paste(dat$Day,dat$Place),
FUN=function(x){
y <- as.character(x)
as.integer(factor(y,levels=unique(y)))
}))
# Day Place Name Number
# 1 22 X A 1
# 2 22 X A 1
# 3 22 X B 2
# 4 22 Y C 1
# 5 22 Y C 1
# 6 22 Y D 2
# 7 23 X B 1
# 8 23 X A 2
idea
Group by Day and Place using tapply
For each group, create a coerce the Name to the factor conserving the same order of levels.
Coerce the created factor to integer to get the final result.
using data.table(sugar syntax) :
library(data.table)
setDT(dat)[,Number := {
y <- as.character(Name)
as.integer(factor(y,levels=unique(y)))
},"Day,Place"]
Day Place Name Number
1: 22 X A 1
2: 22 X A 1
3: 22 X B 2
4: 22 Y C 1
5: 22 Y C 1
6: 22 Y D 2
7: 23 X B 1
8: 23 X A 2
idx <- function(x) cumsum(c(TRUE, tail(x, -1) != head(x, -1)))
transform(dat, Number = ave(idx(Name), Day, Place, FUN = idx))
# Day Place Name Number
# 1 22 X A 1
# 2 22 X A 1
# 3 22 X B 2
# 4 22 Y C 1
# 5 22 Y C 1
# 6 22 Y D 2
# 7 23 X B 1
# 8 23 X A 2
Use ddply from plyr.
dfr <- read.table(header = TRUE, text = "Day Place Name
22 X A
22 X A
22 X B
22 X A
22 Y C
22 Y C
22 Y D
23 X B
23 X A")
library(plyr)
ddply(
dfr,
.(Day, Place),
mutate,
Number = as.integer(factor(Name, levels = unique(Name)))
)
Or use dplyr, in a variant of beginneR's deleted answer.
library(dplyr)
dfr %>%
group_by(Day, Place) %>%
mutate(Number = as.integer(factor(Name, levels = unique(Name))))