replace values in a data.frame with values from another data.frame - r

I have two dataframes with different dimensions,
df1 <- data.frame(names= sample(LETTERS[1:10]), duration=sample(0:100, 10))
>df1
names duration
1 J 97
2 G 57
3 H 53
4 A 23
5 E 100
6 D 90
7 C 73
8 F 60
9 B 37
10 I 67
df2 <- data.frame(names= LETTERS[1:5], names_new=letters[1:5])
> df2
names names_new
1 A a
2 B b
3 C c
4 D d
5 E e
I want to replace in df1 the values that match df1$names and df2$names but using the df2$names_new. My desired output would be:
> df1
names duration
1 J 97
2 G 57
3 H 53
4 a 23
5 e 100
6 d 90
7 c 73
8 F 60
9 b 37
10 I 67
This is the code I'm using but I wonder if there is a cleaner way to do it with no so many steps,
df2[,1] <- as.character(df2[,1])
df2[,2] <- as.character(df2[,2])
df1[,1] <- as.character(df1[,1])
match(df1[,1], df2[,1]) -> id
which(!is.na(id)==TRUE) -> idx
id[!is.na(id)] -> id
df1[idx,1] <- df2[id,2]
Many thanks

Here's an approach from qdapTools:
library(qdapTools)
df1$names <- df1$names %lc+% df2
The %l+% is a binary operator version of lookup. The left are the terms and the right side is the lookup table. The + means that any noncomparables will revert back to the original. This is a wrapper for the data.table package and is pretty speedy.
Here is the output including set.seed(1) for reproducibility:
set.seed(1)
df1 <- data.frame(names= sample(LETTERS[1:10]), duration=sample(0:100, 10),stringsAsFactors=F)
df2 <- data.frame(names= LETTERS[1:5], names_new=letters[1:5],stringsAsFactors=F)
library(qdapTools)
df1$names <- df1$names %lc+% df2
df1
## names duration
## 1 c 20
## 2 d 17
## 3 e 68
## 4 G 37
## 5 b 74
## 6 H 47
## 7 I 98
## 8 F 93
## 9 J 35
## 10 a 71

Are all names in df2 also in df1? And do you intent to keep them as a factor? If so, you might find this solution helpful.
idx <- match(levels(df2$names), levels(df1$names))
levels(df1$names)[idx] <- levels(df2$names_new)

This works but requires that names and names_new are character and not factor.
set.seed(1)
df1 <- data.frame(names= sample(LETTERS[1:10]), duration=sample(0:100, 10),stringsAsFactors=F)
df2 <- data.frame(names= LETTERS[1:5], names_new=letters[1:5],stringsAsFactors=F)
rownames(df1) <- df1$names
df1[df2$name,]$names <- df2$names_new

Another option using merge:
transform(merge(df1,df2,all.x=TRUE),
names=ifelse(is.na(names_new),as.character(names),
as.character(names_new)))

Another way using match would be (if df1$names and df1$names are characters of course)
df1[match(df2$names, df1$names), "names"] <- df2$names_new

Related

R convert list with multiple string lengths to data frame

I have a list:
l1<-list(A=1:10, B=100:120, C=300:310, D=400:430)
How do I convert it to dataframe with 2 columns:
C1 C2
R1 1 A
R2 2 A
...
R10 10 A
R11 100 B
R12 101 B
....
R73 429 D
R73 430 D
I tried:
df1 <- data.frame(matrix(unlist(l1), nrow=length(l1), byrow=T))
But I'm getting an error because the vectors in my list have multiple lengths. Also my actual list consist of Dates and not just integers.
Just use stack:
stack(l1)
> head(stack(l1))
values ind
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 6 A
> tail(stack(l1))
values ind
68 425 D
69 426 D
70 427 D
71 428 D
72 429 D
73 430 D
Update
stack won't work with dates. If you have actual date objects, you can do:
data.frame(ind = rep(names(l1), lengths(l1)),
val = as.Date(unlist(l1), origin = "1970-01-01"))
or
data.frame(ind = rep(names(l1), lengths(l1)), val = do.call(c, l1))
Sample data:
l1<-list(A=Sys.Date()+(1:10),
B=Sys.Date()+(100:120),
C=Sys.Date()+(300:310),
D=Sys.Date()+(400:430))
Here's one method: Similar to #Duck answer using Map and do.call
tmp <- Map(data.frame,N = l1,L = names(l1))
out <- do.call(rbind,tmp)
rownames(out) <- NULL
> tail(out)
N L
68 425 D
69 426 D
70 427 D
71 428 D
72 429 D
73 430 D
Maybe a long solution, but using mapply() and do.call() you can reach the expected result. First, you can extract the names of the list as well as the number of elements. Then, using mapply() you can create a list for the first column in your desired result. After that you combine mapply(), do.call(), rbind() and cbind() to end up with df. Here the code:
#Code
#names
v1 <- names(l1)
#length
v2 <- unlist(lapply(l1, length))
#Create values
l2 <- mapply(function(x,y) rep(x,y),v1,v2)
#Bind
df <- as.data.frame(do.call(rbind,mapply(cbind,l2,l1)))
df$V2 <- as.numeric(df$V2)
Output (some rows):
head(df,15)
V1 V2
1 A 1
2 A 24
3 A 25
4 A 37
5 A 69
6 A 70
7 A 71
8 A 72
9 A 73
10 A 2
11 B 3
12 B 4
13 B 5
14 B 6
15 B 7

Multiply only columns with same names

I have a dataset as follows:
df1
Col1 Col2 A B C
A 1 2 3 4
B 2 5 7 8
df2
A B C D E
2 3 4 7 10
I want to multiply only the columns that are matching in both dataframes.
Final expected output:
Col1 Col2 A B C
A 1 4 9 16
B 2 10 21 32
My dataframe has many columns so if this could be dynamic in any way then it would be super helpful.
nm <- intersect(names(df1), names(df2))
df1[nm] <- sweep(df1[nm], 2, unlist(df2[nm]), `*`)
df1
# Col1 Col2 A B C
# 1 A 1 4 9 16
# 2 B 2 10 21 32
Using sweep is the main trick here.
df1[] <- mapply(function(nm, dat) if (nm %in% names(df2) && is.numeric(dat)) dat*df2[[nm]] else dat,
names(df1), df1, SIMPLIFY=FALSE)
df1
# Col1 Col2 A B C
# 1 A 1 4 9 16
# 2 B 2 10 21 32
The df1[] <- ... is effectively (though not precisely) a shortcut of df1 <- as.data.frame(...).
I was just about to suggest intersect when Julius' answer came up ... but I'll include it for completeness (since the rest of that answer is a little different anyway):
df1[intersect(names(df1), names(df2))] <-
mapply(function(nm, dat) dat*df2[[nm]],
intersect(names(df1), names(df2)), df1[intersect(names(df1), names(df2))], SIMPLIFY=FALSE)

Split dataframe into bins based on another vector

suppose I have the following dataframe
x <- c(12,30,45,100,150,305,2,46,10,221)
x2 <- letters[1:10]
df <- data.frame(x,x2)
df <- df[with(df, order(x)), ]
x x2
7 2 g
9 10 i
1 12 a
2 30 b
3 45 c
8 46 h
4 100 d
5 150 e
10 221 j
6 305 f
And I would like to split these into groups based on another vector,
v <- seq(0, 500, 50)
Basically, I would like to partition out each row based on column x and how it matches with to v ( so for example x <= an element in v) - the location/index of that element in v is then used to assign a group for that row. The resulting table should look something like the following:
x x2 group
7 2 g g1
9 10 i g1
1 12 a g1
2 30 b g1
3 45 c g1
8 46 h g2
4 100 d g3
5 150 e g4
10 221 j g4
6 305 f g6
I could try to loop through each row and try and match it to v but I'm still confuse as to how I could easily detect where the match x<=element v occurs so that I can assign a group id to it. thanks.
You can use cut to break up df$x by the values of v:
df$group <- as.numeric(cut(df$x, breaks = v))
df$group <- paste0('g', df$group)
cut returns a factor so you can use as.numeric to just pull out which numeric bucket the value of df$x falls into based on v.

custom function after grouping data.fame

Given the following data.frame
d <- rep(c("a", "b"), each=5)
l <- rep(1:5, 2)
v <- 1:10
df <- data.frame(d=d, l=l, v=v*v)
df
d l v
1 a 1 1
2 a 2 4
3 a 3 9
4 a 4 16
5 a 5 25
6 b 1 36
7 b 2 49
8 b 3 64
9 b 4 81
10 b 5 100
Now I want to add another column after grouping by l. The extra column should contain the value of v_b - v_a
d l v e
1 a 1 1 35 (36-1)
2 a 2 4 45 (49-4)
3 a 3 9 55 (64-9)
4 a 4 16 65 (81-16)
5 a 5 25 75 (100-25)
6 b 1 36 35 (36-1)
7 b 2 49 45 (49-4)
8 b 3 64 55 (64-9)
9 b 4 81 65 (81-16)
10 b 5 100 75 (100-25)
In paranthesis the way how to calculate the value.
I'm looking for a way using dplyr. So I started with something like this
df %.%
group_by(l) %.%
mutate(e=myCustomFunction)
But how should I define myCustomFunction? I thought grouping of the data.frame produces another (sub-)data.frame which is a parameter to this function. But it isn't...
I guess this is the dplyr equivalent to #jlhoward's data.table solution:
df %>%
group_by(l) %>%
mutate(e = v[d == "b"] - v[d == "a"])
Edit after comment by OP:
If you want to use a custom function, here's a possible way:
myfunc <- function(x) {
with(x, v[d == "b"] - v[d == "a"])
}
test %>%
group_by(l) %>%
do(data.frame(. , e = myfunc(.))) %>%
arrange(d, l) # <- just to get it back in the original order
Edit after comment by #hadley:
As hadley commented below, it would be better in this case to define the function as
f <- function(v, d) v[d == "b"] - v[d == "a"]
and then use the custom function f inside a mutate:
df %>%
group_by(l) %>%
mutate(e = f(v, d))
Thanks #hadley for the comment.
Using dplyr:
df %.%
group_by(l) %.%
mutate(e=diff(v))
# d l v e
# 1 a 1 1 35
# 2 a 2 4 45
# 3 a 3 9 55
# 4 a 4 16 65
# 5 a 5 25 75
# 6 b 1 36 35
# 7 b 2 49 45
# 8 b 3 64 55
# 9 b 4 81 65
# 10 b 5 100 75
Here's an approach using data tables.
library(data.table)
DT <- as.data.table(df)
DT[,e := diff(v), by=l]
These approaches using diff(...) assume your data frame is sorted as in your example. If not, this is a more reliable way to do the same thing.
DT[, e := .SD[d == "b", v] - .SD[d == "a", v], by=l]
(or) even more directly
DT[, e := v[d == "b"] - v[d == "a"], by=l]
But if you want to access the entire subset of data and pass it to your custom function, then you can use .SD. Also make sure you read about ?.SDcols from ?data.table.
If you want to consider a non-dplyr option
df$e <- with(df, ave(v, l, FUN=function(x) diff(x)))
will do the trick. The ave function is useful for calculating values for groups of observations.

Change the index number of a dataframe

After I'm done with some manipulation in Dataframe, I got a result dataframe. But the index are not listed properly as below.
MsgType/Cxr NoOfMsgs AvgElpsdTime(ms)
161 AM 86 30.13
171 CM 1 104
18 CO 27 1244.81
19 US 23 1369.61
20 VK 2 245
21 VS 11 1273.82
112 fqa 78 1752.22
24 SN 78 1752.22
I would like to get the result as like below.
MsgType/Cxr NoOfMsgs AvgElpsdTime(ms)
1 AM 86 30.13
2 CM 1 104
3 CO 27 1244.81
4 US 23 1369.61
5 VK 2 245
6 VS 11 1273.82
7 fqa 78 1752.22
8 SN 78 1752.22
Please guide how I can get this ?
These are the rownames of your dataframe, which by default are 1:nrow(dfr). When you reordered the dataframe, the original rownames are also reordered. To have the rows of the new order listed sequentially, just use:
rownames(dfr) <- 1:nrow(dfr)
Or, simply
rownames(df) <- NULL
gives what you want.
> d <- data.frame(x = LETTERS[1:5], y = letters[1:5])[sample(5, 5), ]
> d
x y
5 E e
4 D d
3 C c
2 B b
1 A a
> rownames(d) <- NULL
> d
x y
1 E e
2 D d
3 C c
4 B b
5 A a
The index is actually the data frame row names. To change them, you can do something like:
rownames(dd) = 1:dim(dd)[1]
or
rownames(dd) = 1:nrow(dd)
Personally, I never use rownames.
In your example, I suspect that you don't need to worry about them either, since you are just renaming them 1 to n. In particular, when you subset your data frame the rownames will again be incorrect. For example,
##Simple data frame
R> dd = data.frame(a = rnorm(6))
R> dd$type = c("A", "B")
R> rownames(dd) = 1:nrow(dd)
R> dd
a type
1 2.1434 A
2 -1.1067 B
3 0.7451 A
4 -0.1711 B
5 1.4348 A
6 -1.3777 B
##Basic subsetting
R> dd_sub = dd[dd$type=="A",]
##Rownames are "wrong"
R> dd_sub
a type
1 2.1434 A
3 0.7451 A
5 1.4348 A

Resources