This question already has answers here:
Split delimited strings in a column and insert as new rows [duplicate]
(6 answers)
Closed 4 years ago.
Say I have a data frame like the following:
> mydf <- data.frame(a=c('A','B','C','D/E','F','G/H','I/J','K','L'), b=c(1,2,3,'4/5',6,'7/8','9/10',11,12))
> mydf
a b
1 A 1
2 B 2
3 C 3
4 D/E 4/5
5 F 6
6 G/H 7/8
7 I/J 9/10
8 K 11
9 L 12
How do I make it look like the following, with an easy one-liner (preferably base)? Thanks
> mydf2
a b
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
6 F 6
7 G 7
8 H 8
9 I 9
10 J 10
11 K 11
12 L 12
You can use separate_rows from the tidyr package
library(tidyr)
mydf <- data.frame(a=c('A','B','C','D/E','F','G/H','I/J','K','L'), b=c(1,2,3,'4/5',6,'7/8','9/10',11,12))
mydf
#> a b
#> 1 A 1
#> 2 B 2
#> 3 C 3
#> 4 D/E 4/5
#> 5 F 6
#> 6 G/H 7/8
#> 7 I/J 9/10
#> 8 K 11
#> 9 L 12
separate_rows(mydf, a, b, convert = TRUE)
#> a b
#> 1 A 1
#> 2 B 2
#> 3 C 3
#> 4 D 4
#> 5 E 5
#> 6 F 6
#> 7 G 7
#> 8 H 8
#> 9 I 9
#> 10 J 10
#> 11 K 11
#> 12 L 12
Created on 2018-04-18 by the reprex package (v0.2.0).
Related
I have a data frame that gets updated frequently, and there are some rows that need to be removed from it if certain strings are found in them. I have done that previously using -grep to remove the rows containing the string in question, eg:
dataframe[-grep('some string', dataframe$column),]
However, at times that string doesn't appear in the dataframe, in which case the -grep is returning an empty dataframe. Here's a minimal reproducible example:
> test.df<-data.frame(number=c(1:10), letter=letters[1:10])
> test.df
number letter
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
8 8 h
9 9 i
10 10 j
> test.df[-grep('h', test.df$letter),]
number letter
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
9 9 i
10 10 j
> test.df[-grep('k', test.df$letter),]
[1] number letter
<0 rows> (or 0-length row.names)
I could wrap the 'test.df[-grep...' in an 'if' test to check if the search string is found prior to removing it, eg:
if(any(grepl('k',test.df$letter))){test.df<-test.df[-grep('k', test.df$letter),]}
...but it seems to me that this should be implicit in the -grep command. Is there a better (more efficient) way to accomplish row removal that doesn't threaten to remove all my data if the search string is absent from the data frame?
Using grepl you could do:
test.df <- data.frame(number = c(1:10), letter = letters[1:10])
test.df[!grepl("h", test.df$letter), ]
#> number letter
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#> 6 6 f
#> 7 7 g
#> 9 9 i
#> 10 10 j
test.df[!grepl("k", test.df$letter), ]
#> number letter
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#> 6 6 f
#> 7 7 g
#> 8 8 h
#> 9 9 i
#> 10 10 j
Created on 2023-01-19 with reprex v2.0.2
Instead of using - when subsetting, in grep invert could be used.
test.df[grep('k', test.df$letter, invert=TRUE),]
# number letter
#1 1 a
#2 2 b
#3 3 c
#4 4 d
#5 5 e
#6 6 f
#7 7 g
#8 8 h
#9 9 i
#10 10 j
test.df[grep('h', test.df$letter, invert=TRUE),]
# number letter
#1 1 a
#2 2 b
#3 3 c
#4 4 d
#5 5 e
#6 6 f
#7 7 g
#9 9 i
#10 10 j
In this case it looks like that the whole sting should be matched, where an alternative would be to use == or !=.
test.df[test.df$letter != "k",]
test.df[test.df$letter != "h",]
I am analyzing a dataset and need to find matching samples between 2 versions of the data.
they (should) contain the same expression data but they have different sample identifiers. Lets say the first dataframe looks like this:
gene sample expression
1 a a 1
2 a b 2
3 a c 3
4 a d 4
5 a e 5
6 a f 6
7 a g 7
8 a h 8
9 a i 9
10 a j 10
11 a k 11
12 a l 12
13 a m 13
14 a n 14
I made the dataframe for one gene, but u can imagine that this is a large dataset containing ~20k genes. What I need to do is find the closest match in gene expression so I know which samples correspond. the second dataframe might look like this:
gene sample expression
1 a z 1.5
2 a y 2.5
3 a x 3
4 a w 4.5
5 a v 5.7
6 a u 6.2
7 a t 7.8
8 a s 8.1
9 a r 9.8
10 a q 10.5
11 a p 11
12 a o 12
13 a 2 13.3
14 a 4 14.4
what I need to do is write a function (or something like that) that try's to match the expressions of genes in a dataframe as closely as possible (for all genes) and report the sample identifiers with the closest match. I'm quite new to R and could use a little help.
I would like the output to look like this::
gene sample expression sample2
1 a z 1 z
2 a y 2 y
3 a x 3 x
4 a w 4 w
5 a v 5 v
6 a u 6 u
7 a t 7 t
8 a s 8 s
9 a r 9 r
10 a q 10 q
11 a p 11 p
12 a o 12 o
13 a 2 13 2
14 a 4 14 4
an extra column per sample that sepcifies the closest match in gene expression accros all genes. But the extra column must be created based on all genes and not on one gene.
Here are two options. In your example, it looks like there are always whole number matches, so you could join by whole number. Alternatively, you could try to extract the closest number. I use floor because it looks like you want 1.5 to be joined to 1 and not 2.
library(tidyverse)
#extract closest whole number
df1 |>
mutate(sample2 = map_chr(expression,
\(x)df2$sample[which.min(abs(x - floor(df2$expression)))]))
#> # A tibble: 14 x 4
#> gene sample expression sample2
#> <chr> <chr> <dbl> <chr>
#> 1 a a 1 z
#> 2 a b 2 y
#> 3 a c 3 x
#> 4 a d 4 w
#> 5 a e 5 v
#> 6 a f 6 u
#> 7 a g 7 t
#> 8 a h 8 s
#> 9 a i 9 r
#> 10 a j 10 q
#> 11 a k 11 p
#> 12 a l 12 o
#> 13 a m 13 2
#> 14 a n 14 4
#join by whole number
left_join(df1,
df2 |>
mutate(expression = as.numeric(gsub("^(.*)\\.\\d+$", "\\1", expression))) |>
select(sample2 = sample, expression),
by = "expression")
#> # A tibble: 14 x 4
#> gene sample expression sample2
#> <chr> <chr> <dbl> <chr>
#> 1 a a 1 z
#> 2 a b 2 y
#> 3 a c 3 x
#> 4 a d 4 w
#> 5 a e 5 v
#> 6 a f 6 u
#> 7 a g 7 t
#> 8 a h 8 s
#> 9 a i 9 r
#> 10 a j 10 q
#> 11 a k 11 p
#> 12 a l 12 o
#> 13 a m 13 2
#> 14 a n 14 4
Here's a simplified example of my data:
I have a list of dataframes
set.seed(1)
data1 <- data.frame(A = sample(1:10))
data2 <- data.frame(A = sample(1:10))
data3 <- data.frame(A = sample(1:10))
data4 <- data.frame(A = sample(1:10))
list1 <- list(data1, data2, data3, data4)
And a dataframe containing the same number of values as there are dataframes in list1
data5 <- data.frame(B = c(10, 20, 30, 40))
I would like to create a new column C in each of the dataframes within list1 where:
C = A * (B/nrow(A))
with the value for B coming from data5, so that B = 10 for the first dataframe in list1 (i.e. data1), and B = 20 for the second dataframe data2 and so on.
From what I've read, mapply is probably the solution, but I'm struggling to work out how to specify a single value of B across all rows in each of the dataframes in list1.
Any suggestions would be hugely appreciated.
You need to use Map to loop on different vectors or list in parallel :
Map(function(df, B) transform(df, C = A*(B/nrow(df))),list1,data5$B)
#> [[1]]
#> A C
#> 1 8 8
#> 2 10 10
#> 3 1 1
#> 4 6 6
#> 5 7 7
#> 6 9 9
#> 7 3 3
#> 8 4 4
#> 9 2 2
#> 10 5 5
#>
#> [[2]]
#> A C
#> 1 10 20
#> 2 3 6
#> 3 2 4
#> 4 1 2
#> 5 9 18
#> 6 4 8
#> 7 6 12
#> 8 5 10
#> 9 8 16
#> 10 7 14
#>
#> [[3]]
#> A C
#> 1 5 15
#> 2 7 21
#> 3 1 3
#> 4 4 12
#> 5 2 6
#> 6 6 18
#> 7 10 30
#> 8 3 9
#> 9 8 24
#> 10 9 27
#>
#> [[4]]
#> A C
#> 1 3 12
#> 2 9 36
#> 3 6 24
#> 4 4 16
#> 5 2 8
#> 6 1 4
#> 7 10 40
#> 8 5 20
#> 9 7 28
#> 10 8 32
You can be a bit more compact using tidyverse :
library(tidyverse)
map2(list1, data5$B, ~mutate(.x, C = A*(.y/nrow(.x))))
I'm trying to reshape a data frame in R:
Gene_ID Value Gene_ID.1 Value.1 Gene_ID.2 Value.2
1 A 0 A 3 A 1
2 B 5 B 6 B 5
3 C 7 C 2 C 7
4 D 8 D 9 D 2
5 E 5 E 8 E 4
6 F 6 F 4 F 5
I want to make it look like this:
Gene_ID Value
1 A 0
2 B 5
3 C 7
4 D 8
5 E 5
6 F 6
7 A 1
8 B 5
9 C 7
10 D 2
11 E 4
12 F 5
13 A 3
14 B 6
15 C 2
16 D 9
17 E 8
18 F 4
So simply stack the columns with the same names together. Is there a way to do so?
Thanks!
You can use either the combination of gather()/spread() or pivot_longer() from the tidyr package.
To learn more about the new pivot_xxx() functions, check out these links:
A Graphical Introduction to tidyr's pivot_*()
Pivoting data from columns to rows (and back!) in the tidyverse
library(dplyr)
library(tidyr)
txt <- " Gene_ID.0 Value.0 Gene_ID.1 Value.1 Gene_ID.2 Value.2
1 A 0 A 3 A 1
2 B 5 B 6 B 5
3 C 7 C 2 C 7
4 D 8 D 9 D 2
5 E 5 E 8 E 4
6 F 6 F 4 F 5"
dat <- read.table(text = txt, header = TRUE)
Combine gather(), separate() and spread() functions
dat %>%
mutate(Row_Nr = row_number()) %>%
gather(key, value, -Row_Nr) %>%
separate(key, into = c("key", "Gene_Nr"), sep = "\\.") %>%
spread(key, value) %>%
select(-Row_Nr)
#> Warning: attributes are not identical across measure variables;
#> they will be dropped
#> Gene_Nr Gene_ID Value
#> 1 0 A 0
#> 2 1 A 3
#> 3 2 A 1
#> 4 0 B 5
#> 5 1 B 6
#> 6 2 B 5
#> 7 0 C 7
#> 8 1 C 2
#> 9 2 C 7
#> 10 0 D 8
#> 11 1 D 9
#> 12 2 D 2
#> 13 0 E 5
#> 14 1 E 8
#> 15 2 E 4
#> 16 0 F 6
#> 17 1 F 4
#> 18 2 F 5
Use pivot_longer()
### gather all values columns
### separate original column names by the period "."
### into Gene_ID/Value and Gene_Nr
dat %>%
pivot_longer(everything(),
names_to = c(".value", "Gene_Nr"),
names_pattern = "(.*)\\.(.*)")
#> Gene_Nr Gene_ID Value
#> 1 0 A 0
#> 2 1 A 3
#> 3 2 A 1
#> 4 0 B 5
#> 5 1 B 6
#> 6 2 B 5
#> 7 0 C 7
#> 8 1 C 2
#> 9 2 C 7
#> 10 0 D 8
#> 11 1 D 9
#> 12 2 D 2
#> 13 0 E 5
#> 14 1 E 8
#> 15 2 E 4
#> 16 0 F 6
#> 17 1 F 4
#> 18 2 F 5
Created on 2019-12-08 by the reprex package (v0.3.0)
This is an exmaple of fct_reorder
boxplot(Sepal.Width ~ fct_reorder(Species, Sepal.Width, .desc = TRUE), data = iris)
This code is identical with boxplot(Sepal.Width ~ reorder(Species, -Sepal.Width), data = iris)
What is the better point fct_reorder() than reorder()?
The two functions are very similar, but have a few differences:
reorder() works with atomic vectors and defaults to using mean().
fct_reorder() only works with factors (or character vectors) and defaults to using median()
Example:
library(forcats)
x <- 1:10
xf <- factor(1:10)
y <- 10:1
reorder(x, y)
#> [1] 1 2 3 4 5 6 7 8 9 10
#> attr(,"scores")
#> 1 2 3 4 5 6 7 8 9 10
#> 10 9 8 7 6 5 4 3 2 1
#> Levels: 10 9 8 7 6 5 4 3 2 1
reorder(xf, y)
#> [1] 1 2 3 4 5 6 7 8 9 10
#> attr(,"scores")
#> 1 2 3 4 5 6 7 8 9 10
#> 10 9 8 7 6 5 4 3 2 1
#> Levels: 10 9 8 7 6 5 4 3 2 1
fct_reorder(x, y)
#> Error: `f` must be a factor (or character vector).
fct_reorder(xf, y)
#> [1] 1 2 3 4 5 6 7 8 9 10
#> Levels: 10 9 8 7 6 5 4 3 2 1
Created on 2022-01-07 by the reprex package (v2.0.1)