This question already has answers here:
Creating a new variable from a lookup table
(4 answers)
Closed 1 year ago.
I am working on R and I have the following data frame data:
country
index
value
A
0
15
B
1
15
C
2
15
D
3
15
E
4
15
F
5
15
How could I map values so that I get an extra column EXTRA with specific information. For example I want to pass information (in any form) that countries with index 0,1 and 2 should have value first in EXTRA, 3 and 5 should have second and 4 for example eleventh. So the expected output would look like this:
country
index
value
EXTRA
A
0
15
first
B
1
15
first
C
2
15
first
D
3
15
second
E
4
15
eleventh
F
5
15
second
We can use a named vector to match and replace
nm1 <- setNames(c('first', 'first', 'first', 'second', 'eleventh', 'second'), 0:5)
df1$EXTRA <- nm1[as.character(df1$index)]
Or can use a join
library(data.table)
keydat <- data.frame(index = 0:5,
EXTRA = c('first', 'first', 'first', 'second', 'eleventh', 'second'))
setDT(df1)[keydat, EXTRA := EXTRA, on = .(index)]
data
df1 <- structure(list(country = c("A", "B", "C", "D", "E", "F"), index = 0:5,
value = c(15L, 15L, 15L, 15L, 15L, 15L)), class = "data.frame",
row.names = c(NA,
-6L))
Here is one option using nested ifelse
transform(
df,
EXTRA = ifelse(index %in% 0:2,
"first",
ifelse(index %in% c(3, 5),
"second",
"eleventh"
)
)
)
or merge + stack
merge(df,
setNames(
stack(list(first = 0:2, second = c(3, 5), eleventh = 4)),
c("index", "EXTRA")
),
by = "index",
all.x = TRUE
)
which gives
country index value EXTRA
1 A 0 15 first
2 B 1 15 first
3 C 2 15 first
4 D 3 15 second
5 E 4 15 eleventh
6 F 5 15 second
Related
Say I have a list c of three data frames:
> c
$first
a b
1 1 2
2 2 3
3 3 4
$second
a b
1 2 4
2 4 6
3 6 8
$third
a b
1 3 6
2 6 9
3 9 12
I want to run an lapply on c that will do a custom function on each data frame.
The custom function depends on three numbers and I want the function to use a different number depending on which data frame it's evaluating.
I was thinking of utilizing the names 'first', 'second', and 'third', but I'm unsure how to get those names once they're inside the lapply function. It would look something like this:
lapply(c, function(list, num1 = 1, num2 = -1, num3 = 0) {num <- ifelse(names(list) == "first", num1, ifelse(names(list) == "second", num2, num3)); return(list*num)})
So the result I would want would be first multiplied by 1, second multiplied by -1, and third multiplied by 0.
The names function gives the values a and b (the column names) instead of the name of the data frame itself, so that doesn't work. Is there a function that would be able to give me the 'first', 'second', and 'third' values I need?
Or alternatively, is there a better way of doing this in a lapply function?
May be, it would be easier with Map. We pass the number of interest in the order we want and do a simple multiplication
Map(`*`, lst1, c(1, -1, 0))
If the numbers are named
num1 <- setNames(c(1, -1, 0), c("first", "third", "second"))
then, match with the names of the list
Map(`*`, lst1, num1[names(lst1)])
#$first
# a b
#1 1 2
#2 2 3
#3 3 4
#$second
# a b
#1 0 0
#2 0 0
#3 0 0
#$third
# a b
#1 -3 -6
#2 -6 -9
#3 -9 -12
Or if we decide to go with lapply, loop over the names of the list , extract the list element based on the name as well as the corresponding vector element (named vector)
lapply(names(lst1), function(nm) lst1[[nm]] * num1[nm])
Or with sapply
sapply(names(lst1), function(nm) lst1[[nm]] * num1[nm], simplify = FALSE)
Or another option is map2 from purrr
library(purrr)
map2(lst1, num1[names(lst1)], `*`)
Note: c is a function name and it is not recommended to create object names with function names
data
lst1 <- list(first = structure(list(a = 1:3, b = 2:4), class = "data.frame",
row.names = c("1",
"2", "3")), second = structure(list(a = c(2L, 4L, 6L), b = c(4L,
6L, 8L)), class = "data.frame", row.names = c("1", "2", "3")),
third = structure(list(a = c(3L, 6L, 9L), b = c(6L, 9L, 12L
)), class = "data.frame", row.names = c("1", "2", "3")))
Besides the solutions by #akrun, you can also try the following code
mapply(`*`, lst1, c(1, -1, 0),SIMPLIFY = F)
or
lapply(seq_along(lst1), function(k) lst1[[k]]*c(1,-1,0)[k])
Lets assume there are 2 columns of two huge dataframes (different lengths) like:
df1 df2
A 1 C X
A 1 D X
B 4 C X
A 1 F X
B 4 A X
B 4 B X
C 7 B X
Each time there is a match in the 1st columns, X should be replaced with data from column 2 of df1. If the 1st column of df2 contains Elements, which are still not in the first column of df1 (F, D), X should be replaced with 0.
Hence there is a huge dataframe, a loop in a loop would not be useful.
The solution should look like this:
df1 df2
A 1 C 7
A 1 D 0
B 4 C 7
A 1 F 0
B 4 A 1
B 4 B 4
C 7 B 4
Thank You in advance
As there are duplicate rows in 'df1', we can get the unique rows
df3 <- unique(df1)
Then, use match to get the idnex
i1 <- match(df2$Col1, df3$Col1)
and based on the index, assign
df2$Col2 <- df3$Col2[i1]
If there are no matches, it would be NA, which can be changed to 0
df2$Col2[is.na(df2$Col2)] <- 0
df2
# Col1 Col2
#1 C 7
#2 D 0
#3 C 7
#4 F 0
#5 A 1
#6 B 4
#7 B 4
Or this can be done with data.table by joining on the 'Col1' and assigning the 'Col2' (after removing the Col2 from the second data) with the Col2 from 'df3'
library(data.table)
setDT(df2)[, Col2 := NULL][df3, Col2 := Col2, on = .(Col1)]
data
df1 <- structure(list(Col1 = c("A", "A", "B", "A", "B", "B", "C"), Col2 = c(1,
1, 4, 1, 4, 4, 7)), class = "data.frame", row.names = c(NA, -7L
))
df2 <- structure(list(Col1 = c("C", "D", "C", "F", "A", "B", "B"), Col2 = c("X",
"X", "X", "X", "X", "X", "X")), class = "data.frame", row.names = c(NA,
-7L))
I have this data:
dat=list(structure(list(Group.1 = structure(3:4, .Label = c("A","B", "C", "D", "E", "F"), class = "factor"), Pr1 = c(65, 75)), row.names = c(NA, -2L), class = "data.frame"),NULL, structure(list( Group.1 = structure(3:4, .Label = c("A","B", "C", "D", "E", "F"), class = "factor"), Pr1 = c(81,4)), row.names = c(NA,-2L), class = "data.frame"))
I want to use combine using bind_rows(dat) but keeping the index number as a varaible
Output Include Type([[1]] and [[3]])
type Group.1 Pr1
1 1 C 65
2 1 D 75
3 3 C 81
4 3 D 4
data.table solution
use rbindlist() from the data.table-package, which had built-in id-support that respects NULL df's.
library(data.table)
rbindlist( dat, idcol = TRUE )
.id Group.1 Pr1
1: 1 C 65
2: 1 D 75
3: 3 C 81
4: 3 D 4
dplyr - partly solution
bind_rows also has ID-support, but it 'skips' empty elements...
bind_rows( dat, .id = "id" )
id Group.1 Pr1
1 1 C 65
2 1 D 75
3 2 C 81
4 2 D 4
Note that the ID of the third element from dat becomes 2, and not 3.
According to the documentation of bind_rows() you can supply the name for .id argument of the function. When you apply bind_rows() to the list of data.frames the names of the list containing your data.frames are assigned to the identifier column. [EDIT] But there is a problem mentioned by #Wimpel:
names(dat)
NULL
However, supplying the names to the list will do the thing:
names(dat) <- 1:length(dat)
names(dat)
[1] "1" "2" "3"
bind_rows(dat, .id = "type")
type Group.1 Pr1
1 1 C 65
2 1 D 75
3 3 C 81
4 3 D 4
Or in one line, if you prefer:
bind_rows(setNames(dat, seq_along(dat)), .id = "type")
I would like to use R to get all pairs from two column with index. It may need some loop to finish this function. For example, turn two columns with the gene name and index:
a 1,
b 1,
c 1,
d 2,
e 2
into a new matrix
a b 1,
b c 1,
a c 1,
d e 2
Can anyone help?
A tidyverse option using combn on a grouped data.frame:
library(tidyverse)
df %>% group_by(index) %>%
summarise(gene = list(as_data_frame(t(combn(gene, 2))))) %>%
unnest(.sep = '_')
## # A tibble: 4 × 3
## index gene_V1 gene_V2
## <int> <chr> <chr>
## 1 1 a b
## 2 1 a c
## 3 1 b c
## 4 2 d e
The same logic can be replicated in base R:
df2 <- aggregate(gene ~ index, df, function(x){t(combn(x, 2))})
do.call(rbind, apply(df2, 1, data.frame))
## index gene.1 gene.2
## 1 1 a b
## 2 1 a c
## 3 1 b c
## 4 2 d e
Data
df <- structure(list(gene = c("a", "b", "c", "d", "e"), index = c(1L,
1L, 1L, 2L, 2L)), .Names = c("gene", "index"), row.names = c(NA,
-5L), class = "data.frame")
Here is an option using data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'index', we get the combn of 'gene', transpose it and set the names of the 2nd and 3rd column (if needed).
library(data.table)
setnames(setDT(df)[, transpose(combn(gene, 2, FUN = list)),
by = index], 2:3, paste0("gene", 1:2))[]
# index gene1 gene2
#1: 1 a b
#2: 1 a c
#3: 1 b c
#4: 2 d e
I have multiple files with many rows and three columns and need to merge them on the basis of first two columns match. File1
12 13 a
13 15 b
14 17 c
4 9 d
. . .
. . .
81 23 h
File 2
12 13 e
3 10 b
14 17 c
4 9 j
. . .
. . .
1 2 k
File 3
12 13 m
13 15 k
1 7 x
24 9 d
. . .
. . .
1 2 h
and so on.
I want to merge them to obtain the following result
12 13 a e m
13 15 b k
14 17 c c
4 9 d j
3 10 b
24 9 d
. . .
. . .
81 23 h
1 2 k
1 7 x
The first thing that usually comes to mind with these types of problems is merge, perhaps in conjunction with a Reduce(function(x, y) merge(x, y, by = "somecols", all = TRUE), yourListOfDataFrames).
However, merge is not always the most efficient function, especially since it looks like you want to "collapse" all the values to fill in the rows from left to right, which would not be the default merge behavior.
Instead, I suggest you stack everything into one long data.frame and reshape it after you have added an index variable.
Here are two approaches:
Option 1: "dplyr" + "tidyr"
Use mget to put all of your data.frames into a list.
Use rbind_all to convert that list into a single data.frame.
Use sequence(n()) in mutate from "dplyr" to group the data and create an index.
Use spread from "tidyr" to transform from a "long" format to a "wide" format.
library(dplyr)
library(tidyr)
combined <- rbind_all(mget(ls(pattern = "^file\\d")))
combined %>%
group_by(V1, V2) %>%
mutate(time = sequence(n())) %>%
ungroup() %>%
spread(time, V3, fill = "")
# Source: local data frame [7 x 5]
#
# V1 V2 1 2 3
# 1 1 7 x
# 2 3 10 b
# 3 4 9 d j
# 4 12 13 a e m
# 5 13 15 b k
# 6 14 17 c c
# 7 24 9 d
Option 2: "data.table"
Use mget to put all of your data.frames into a list.
Use rbindlist to convert that list into a single data.table.
Use sequence(.N) to generate your sequence by your groups.
Use dcast.data.table to convert the "long" data.table into a "wide" one.
library(data.table)
dcast.data.table(
rbindlist(mget(ls(pattern = "^file\\d")))[,
time := sequence(.N), by = list(V1, V2)],
V1 + V2 ~ time, value.var = "V3", fill = "")
# V1 V2 1 2 3
# 1: 1 7 x
# 2: 3 10 b
# 3: 4 9 d j
# 4: 12 13 a e m
# 5: 13 15 b k
# 6: 14 17 c c
# 7: 24 9 d
Both of these answers assume we are starting with the following sample data:
file1 <- structure(
list(V1 = c(12L, 13L, 14L, 4L), V2 = c(13L, 15L, 17L, 9L),
V3 = c("a", "b", "c", "d")), .Names = c("V1", "V2", "V3"),
class = "data.frame", row.names = c(NA, -4L))
file2 <- structure(
list(V1 = c(12L, 3L, 14L, 4L), V2 = c(13L, 10L, 17L, 9L),
V3 = c("e", "b", "c", "j")), .Names = c("V1", "V2", "V3"),
class = "data.frame", row.names = c(NA, -4L))
file3 <- structure(
list(V1 = c(12L, 13L, 1L, 24L), V2 = c(13L, 15L, 7L, 9L),
V3 = c("m", "k", "x", "d")), .Names = c("V1", "V2", "V3"),
class = "data.frame", row.names = c(NA, -4L))