In R, I want to separate numbers that are in the same column. My data appear like this:
id time
1 1,2
2 3,4
3 4,5,6
I want it to appear like this:
1 1
1 2
2 3
2 4
3 4
3 5
3 6
Though not shown, there are different iterations of time that vary depending on the id. For example:
4 1,6,7
5 1,3,6
6 1,4,5
7 1,3,5
8 2,3,4
There are 100 ids and the time column has different #s that vary in order as shown above.
Does anyone have advice to do this?
An option with separate_rows
library(dplyr)
library(tidyr)
df %>%
separate_rows(time, sep = "(?<=.)(?=.)", convert = TRUE)
# A tibble: 4 x 2
# id time
# <dbl> <int>
#1 1 1
#2 1 2
#3 2 3
#4 2 4
data
df <- structure(list(id = c(1, 2), time = c(12, 34)), class = "data.frame",
row.names = c(NA,
-2L))
Using tidyverse you could try the following. Make sure time is character type, and use strsplit to split up into single characters.
library(tidyverse)
df %>%
mutate(time = strsplit(as.character(time), ",")) %>%
unnest(cols = time)
Or you can just use separate_rows and indicate comma as separator:
df %>%
separate_rows(time, sep = ',')
Or in base R you could try this:
s <- strsplit(df$time, ',', fixed = T)
data.frame(id = unlist(s), time = rep(df$id, lengths(s)))
Output
# A tibble: 10 x 2
id time
<int> <chr>
1 1 1
2 1 2
3 2 3
4 2 4
5 3 4
6 3 5
7 3 6
8 4 1
9 4 6
10 4 7
Data
df <- structure(list(id = 1:4, time = c("1,2", "3,4", "4,5,6", "1,6,7"
)), class = "data.frame", row.names = c(NA, -4L))
Related
I wanted to delete rows in x1 column that don't appear in EVERY month in another column:
The dataset is as follows:
id month
1 01
2 01
3 01
1 02
2 02
1 03
2 03
I want to delete id = 3 from the dataset, since it doesn't appear in month = 02
Im using R
Thank you for helping
You can split the dataset and use Reduce, i.e.
remove <- Reduce(setdiff, split(df$id, df$month))
df[!df$id %in% remove,]
id month
1 1 1
2 2 1
4 1 2
5 2 2
6 1 3
7 2 3
As #jay.sf mentioned, you need to assign it back to your dataframe,
df <- df[!df$id %in% remove,]
Using dplyr
library(dplyr)
df %>%
group_by(id) %>%
filter(n_distinct(month) == n_distinct(df$month)) %>%
ungroup
-output
# A tibble: 6 × 2
id month
<int> <int>
1 1 1
2 2 1
3 1 2
4 2 2
5 1 3
6 2 3
Or using data.table
library(data.table)
data_hh[, if(uniqueN(month) == uniqueN(.SD$month)) .SD, .(id)]
data
data_hh <- structure(list(id = c(18354L, 18815L, 19014L, 63960L, 72996L,
73930L), month = c(1, 1, 1, 1, 1, 1), value = c(113.33, 251.19,
160.15, 278.8, 254.39, 733.22), x1 = c(96.75, 186.78, 106.02,
195.23, 184.57, 473.92), x2 = c(1799.1, 5399.1, 1799.1, 1349.1,
2924.1, 2024.1), x3 = c(85.37, 74.36, 66.2, 70.02, 72.55, 64.63
), x4 = c(6.29, 4.65, 8.9, 20.66, 8.69, 36.22)), row.names = c(NA,
-6L), class = c("data.table", "data.frame"))
I have a genetic dataset of IDs (dataset1) and a dataset of IDs which interact with each other (dataset2). I am trying to count IDs in dataset1 which appear in either of 2 interaction columns in dataset2 and also record which are the interacting/matching IDs in a 3rd column.
Dataset1:
ID
1
2
3
Dataset2:
Interactor1 Interactor2
1 5
2 3
1 10
Output:
ID InteractionCount Interactors
1 2 5, 10
2 1 3
3 1 2
So the output contains all IDs of dataset1 and a count of those IDs also appear in either column 1 or 2 of dataset2, and if it did appear it also stores which ID numbers in dataset2 it interacts with.
I have a biology background, so have guessed at approaching this, so far I've managed to use merge() and setDT(mergeddata)[, .N, by=ID] to try to count the dataset1 IDs which appear in dataset2, but I'm not sure if this is the right approach to be able to add in the creation of the column storing the interacting IDs. Any help on possible functions which can store matched IDs in a 3rd column would be appreciated.
Input data:
dput(dataset1)
structure(list(ID = 1:3), row.names = c(NA, -3L), class = c("data.table",
"data.frame"))
dput(dataset2)
structure(list(Interactor1 = c(1L, 2L, 1L), Interactor2 = c(5L,
3L, 10L)), row.names = c(NA, -3L), class = c("data.table", "data.frame"
))
Here is an option using data.table:
x <- names(DT2)
cols <- c("InteractionCount", "Interactors")
#ensure that the pairs are ordered for each row and there are no duplicated pairs
DT2 <- setkeyv(unique(DT2[,(x) := .(pmin(i1, i2), pmax(i1, i2))]), x)
#for each ID find the neighbours linked to it
neighbours <- rbindlist(list(DT2[, .(.N, toString(i2)), i1],
DT2[, .(.N, toString(i1)), i2]), use.names=FALSE)
setnames(neighbours, names(neighbours), c("ID", cols))
#update dataset1 using the above data
dataset1[, (cols) := neighbours[dataset1, on=.(ID), mget(cols)]]
output for dataset1:
ID InteractionCount Interactors
1: 1 2 5, 10
2: 2 1 3
3: 3 1 2
data:
library(data.table)
DT1 <- structure(list(ID = 1:3), row.names = c(NA, -3L), class = c("data.table", "data.frame"))
DT2 <- structure(list(i1 = c(1L, 2L, 1L), i2 = c(5L, 3L, 10L)), row.names = c(NA, -3L), class = c("data.table", "data.frame"))
Another data.table answer.
library(data.table)
d1 <- data.table(ID=1:3)
d2 <- data.table(I1=c(1,2,1),I2=c(5,3,10))
# first stack I1 on I2 and vice versa
Output <- d2[,.(ID=c(I1,I2),x=c(I2,I1))]
Output
# ID x
# 1: 1 5
# 2: 1 10
# 3: 2 3
# 4: 5 1
# 5: 10 1
# 6: 3 2
# then collect the desired columns
Output <- Output[ID %in% unlist(d1[(ID)])][
,.(InteractionCount=.N,
Interactors = list(x)),
by=ID]
Output
# ID InteractionCount Interactors
# 1: 1 2 5,10
# 2: 2 1 3
# 3: 3 1 2
EDIT:
If the IDs are not numeric, you can set a key on d1:
library(data.table)
d1 <- data.table(ID=c("1","2","3A"))
setkey(d1,ID)
d2 <- data.table(I1=c("1","2","1"),I2=c("5","3A","10"))
Output <- d2[,.(ID=c(I1,I2),x=c(I2,I1))]
Output
# ID x
# 1: 1 5
# 2: 1 10
# 3: 2 3A
# 4: 5 1
# 5: 10 1
# 6: 3A 2
Output <- Output[ID %in% unlist(d1[(ID)])][
,.(InteractionCount=.N,
Interactors = list(x)),
by=ID]
Output
# ID InteractionCount Interactors
# 1: 1 2 5,10
# 2: 2 1 3A
# 3: 3A 1 2
Here's a solution based on the tidyverse package.
library(tidyverse)
d1 <- tibble(ID=1:3)
d2 <- tibble(Interactor1=c(1, 2, 1), Interactor2=c(5, 3, 10))
I think some of your difficulty is caused by the fact that your data is not tidy. You can read about what this means on the tidyverse homepage. Let's make d2 tidy:
d2narrow <- d2 %>% gather(key="Where", value="ID", Interactor1, Interactor2)
d2narrow
which gives:
# A tibble: 6 x 2
Where ID
<chr> <dbl>
1 Interactor1 1
2 Interactor1 2
3 Interactor1 1
4 Interactor2 5
5 Interactor2 3
6 Interactor2 10
Now getting the InteractionCounts is easy:
counts <- d2narrow %>% group_by(ID) %>% summarise(InteractionCount=n())
counts
# A tibble: 5 x 2
ID InteractionCount
<dbl> <int>
1 1 2
2 2 1
3 3 1
4 5 1
5 10 1
We can get a list of Interactor2s for each value of Interactor1 by going back to the original d2...
interactors1 <- d2 %>%
group_by(Interactor1) %>%
summarise(With1=list(unique(Interactor2))) %>%
rename(ID=Interactor1)
interactors1
# A tibble: 2 x 2
ID With1
<dbl> <list>
1 1 <dbl [2]>
2 2 <dbl [1]>
If an ID can appear in both Interactor1 and Interactor2, things get a little more fiddly. (That doesn't happen in your example, but just in case...)
interactors2 <- d2 %>% group_by(Interactor2) %>% summarise(With2=list(unique(Interactor1))) %>% rename(ID=Interactor2)
interactors <- interactors1 %>%
full_join(interactors2, by="ID") %>%
unnest(cols=c(With1, With2)) %>%
mutate(With=ifelse(is.na(With1), With2, With1)) %>%
select(-With1, -With2)
interactors <- interactors %>%
group_by(ID) %>%
summarise(Interactors=list(unique(With)))
Now you can bring everything together, and make sure you get the data only for the IDs you want:
interactors <- d1 %>% left_join(counts, by="ID") %>% left_join(interactors, by="ID")
interactors
# A tibble: 3 x 3
ID InteractionCount Interactors
<dbl> <int> <list>
1 1 2 <dbl [2]>
2 2 1 <dbl [1]>
3 3 1 <dbl [1]>
That's the data in the format you requested (one column with a list of interactors for each ID). Just to prove it:
interactors$Interactors[1]
[[1]]
[1] 5 10
But I think you might find it easier to do more with the answer if it's in tidy form:
interactors %>% unnest(cols=c(Interactors))
# A tibble: 4 x 3
ID InteractionCount Interactors
<dbl> <int> <dbl>
1 1 2 5
2 1 2 10
3 2 1 3
4 3 1 2
I have a dataframe like so:
id val
a 10
a 50
b 30
Now for every id, I want to divide val by the number of repetitions of id and copy the row just as many times. So the final dataframe will become like so:
id val
a 5
a 5
a 25
a 25
b 30
Please note that the duplicate ids may not be consecutive.
How can I achieve this?
One dplyr option could be:
df %>%
group_by(id) %>%
mutate(val = val/n()) %>%
uncount(n())
id val
<chr> <dbl>
1 a 5
2 a 5
3 a 25
4 a 25
5 b 30
Store the counts in a vector and use it to repeat the data.frame:
df = data.frame(id=c("a","a","b"),val=c(10,50,30))
df$id = as.character(df$id)
n = table(df$id)
with(df,data.frame(id=rep(id,n[id]),val=rep(val/n[id],n[id])))
id val
1 a 5
2 a 5
3 a 25
4 a 25
5 b 30
Using tapply and stack.
stack(with(d, tapply(val, id, function(x) rep(x/length(x), each=length(x)))))
# values ind
# 1 5 a
# 2 5 a
# 3 25 a
# 4 25 a
# 5 30 b
Data:
d <- structure(list(id = c("a", "a", "b"), val = c(10L, 50L, 30L)), row.names = c(NA,
-3L), class = "data.frame")
How is it possible to concatenate a dataframe that contains one or more data.frames among its columns. For example:
df <- data.frame(a=1:3)
df$df <- data.frame(a=1:3)
rbind( df, df)
Error in row.names<-.data.frame(*tmp*, value = value) :
duplicate 'row.names' are not allowed In addition: Warning message:
non-unique values when setting 'row.names': ‘1’, ‘2’, ‘3’
library(dplyr)
bind_rows(list(df,df))
Error: Argument 2 can't be a list containing data frames
The issue here seems to be not another data.frame within a data frame, but the non-unique rownames in the result. If you made sure that rownames are unique after rbind - it should work:
df1 <- data.frame(a=1:3)
df2 <- data.frame(a=1:3)
df1$df <- data.frame(a=1:3, row.names=letters[1:3])
df2$df <- data.frame(a=1:3, row.names=LETTERS[1:3])
> res <- rbind(df1, df2)
> res
a a
1 1 1
2 2 2
3 3 3
4 1 1
5 2 2
6 3 3
> res$df
a
a 1
b 2
c 3
A 1
B 2
C 3
The problem seems to be that rbind adjusts the rownames for the two data.frames being merged, but does not adjust the rownames for data.frames within data.frames.
One option would be to replicate df twice (or more) instead of rbind-ing it; this will automatically create non duplicated row.names. Try this:
df[rep(seq_len(nrow(df)), 2), ]
# output
a a
1 1 1
2 2 2
3 3 3
1.1 1 1
2.1 2 2
3.1 3 3
The same process using dplyr will give you more interesting row.names:
library(dplyr)
df %>% slice(rep(row_number(), 2))
# output
a a
1 1 1
2 2 2
3 3 3
4 1 1
5 2 2
6 3 3
We may list the data frames, then using mapply to handle column types differently: stack for vectors and do.call(rbind) for data.frames.
L <- mget(ls(pattern="df\\.")) # or list(df.1, df.2, df.3)
res <- data.frame(a=stack(mapply(`[`, L, 1))[[1]])
res$df <- do.call(rbind, mapply(`[`, L, 2))
res
# a a
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4
# 5 5 5
# 6 6 6
# 7 7 7
# 8 8 8
# 9 9 9
str(res)
# 'data.frame': 9 obs. of 2 variables:
# $ a : int 1 2 3 4 5 6 7 8 9
# $ df:'data.frame': 9 obs. of 1 variable:
# ..$ a: int 1 2 3 4 5 6 7 8 9
Data
df.1 <- structure(list(a = 1:3, df = structure(list(a = 1:3), class = "data.frame", row.names = c(NA,
-3L))), row.names = c(NA, -3L), class = "data.frame")
df.2 <- structure(list(a = 4:6, df = structure(list(a = 4:6), class = "data.frame", row.names = c(NA,
-3L))), row.names = c(NA, -3L), class = "data.frame")
df.3 <- structure(list(a = 7:9, df = structure(list(a = 7:9), class = "data.frame", row.names = c(NA,
-3L))), row.names = c(NA, -3L), class = "data.frame")
This question already has answers here:
Moving average of previous three values in R
(3 answers)
Closed 6 years ago.
I would like to find a dplyr way to take average for the next 3 rows. Say I have a data frame:
data <- structure(list(x = 1:6, y = c(32.1056789265246, 3.48493686329687, 8.21300282100191, 6.72266588891445, 27.7353607044612, 18.5963631547696)), .Names = c("x", "y"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L))
A tibble: 6 × 2
x y
<int> <dbl>
1 1 12.8230546
2 2 3.4083329
3 3 0.4825815
4 4 13.6714485
5 5 8.9829427
6 6 2.5997503
I want to generate a new data frame that has 3 rows with first one the average from row 2,3,4 and next from 3,4,5 and last one from 4,5,6.
A for loop is probably the easiest way but I would appreciate if there is some more elegant dplyr way to go...Thanks!
You can use the rollmean() function from zoo package with lapply to loop through columns, remove the first row if you don't need it:
library(zoo)
as.data.frame(lapply(data, rollmean, 3))
# x y
#1 2 14.601206
#2 3 6.140202
#3 4 14.223676
#4 5 17.684797
If you don't need the first row:
as.data.frame(lapply(data[-1,], rollmean, 3))
# x y
#1 3 6.140202
#2 4 14.223676
#3 5 17.684797
You can use the RcppRoll package to do that as follows:
require(RcppRoll)
roll_mean(data$y[-1], 3) ## 6.140202 14.223676 17.684797
As i am note sure what output you are looking for you could do:
require(dplyr)
data %>%
mutate(rmean = roll_meanl(y, 3)) %>%
filter(between(x, 2, 4)) %>%
select(-y)
Which results in:
# A tibble: 3 × 2
x rmean
<int> <dbl>
1 2 6.140202
2 3 14.223676
3 4 17.684797
Given that you asked specifically about dplyr, you could try this:
library(dplyr)
data %>%
mutate(av3 = (lead(y, n=1L) + lead(y, n=2L) + lead(y, n=3L))/3)
Which creates:
# A tibble: 6 × 3
x y av3
<int> <dbl> <dbl>
1 1 32.105679 6.140202
2 2 3.484937 14.223676
3 3 8.213003 17.684797
4 4 6.722666 NA
5 5 27.735361 NA
6 6 18.596363 NA