R merging tables, with different column names and retaining all columns - r

I have a big job to merge two large data.tables. This is new to me, and I need to demonstrate and explain to colleagues. This is the reason for the paranoid approach, I'd like to randomly select some result rows to assure us all that the merge is doing what we think it is! Here is my MWE, Thx. J
library(data.table)
first <- data.table(index = c("a", "a", "b", "c", "c"),
type = 1:5,
value = 3:7)
second <- data.table(i2 = c("a", "a", "b", "c", "c"),
t2 = c(1:3, 7, 5),
value = 5:9)
second[first , on=c(i2="index", t2="type"), nomatch=0L]
Which is doing the job correctly AFAIK, and gives this result
i2 t2 value i.value
1: a 1 5 3
2: a 2 6 4
3: b 3 7 5
4: c 5 9 7
However I would like, if possible to retain all columns from both tables such that the result would look like:
i2 t2 index type value i.value
1: a 1 a 1 5 3
2: a 2 a 2 6 4
3: b 3 b 3 7 5
4: c 5 c 5 9 7
Is it possible to retain all columns?

Yes, that's possible:
second[first, on=c(i2="index", t2="type"), nomatch=0L, .(i2, t2, index, type, value, i.value)]
i2 t2 index type value i.value
1: a 1 a 1 5 3
2: a 2 a 2 6 4
3: b 3 b 3 7 5
4: c 5 c 5 9 7

Related

cumulative product in R across column

I have a dataframe in the following format
> x <- data.frame("a" = c(1,1),"b" = c(2,2),"c" = c(3,4))
> x
a b c
1 1 2 3
2 1 2 4
I'd like to add 3 new columns which is a cumulative product of the columns a b c, however I need a reverse cumulative product i.e. the output should be
row 1:
result_d = 1*2*3 = 6 , result_e = 2*3 = 6, result_f = 3
and similarly for row 2
The end result will be
a b c result_d result_e result_f
1 1 2 3 6 6 3
2 1 2 4 8 8 4
the column names do not matter this is just an example. Does anyone have any idea how to do this?
as per my comment, is it possible to do this on a subset of columns? e.g. only for columns b and c to return:
a b c results_e results_f
1 1 2 3 6 3
2 1 2 4 8 4
so that column "a" is effectively ignored?
One option is to loop through the rows and apply cumprod over the reverse of elements and then do the reverse
nm1 <- paste0("result_", c("d", "e", "f"))
x[nm1] <- t(apply(x, 1,
function(x) rev(cumprod(rev(x)))))
x
# a b c result_d result_e result_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4
Or a vectorized option is rowCumprods
library(matrixStats)
x[nm1] <- rowCumprods(as.matrix(x[ncol(x):1]))[,ncol(x):1]
temp = data.frame(Reduce("*", x[NCOL(x):1], accumulate = TRUE))
setNames(cbind(x, temp[NCOL(temp):1]),
c(names(x), c("res_d", "res_e", "res_f")))
# a b c res_d res_e res_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4

Replace values from dataframe where vector values match indexes in another dataframe

Perhaps the question and answer are already posted, but I can't find it. Besides, is there any optimal approach to this problem?
Because this is just an example of some rows, but I'll apply it to a data frame of about 1 million rows.
I'm kind of new to R.
I have two data frames
DF1:
a b
1 1 0
2 2 0
3 2 0
4 3 0
5 5 0
and
DF2
l
1 A
2 B
3 C
4 D
5 E
What I try to do, is to match the values in DF1$a with the indexes of DF2 and assign those values to DF1$b so my result would be the following way.
DF1:
a b
1 1 A
2 2 B
3 2 B
4 3 C
5 5 E
I've coded a for loop to do this, but it seems that I'm missing something
for(i in 1:length(df1$a)){
df1$b[i] <- df2$l[df1$a[i]]
}
Which throws the following result:
DF1:
a b
1 1 1
2 2 2
3 2 2
4 3 3
5 5 5
Thanks in advance :)
We can use merge to merge two data frame based on row id and a.
# Create example data frame
DF1 <- data.frame(a = c(1, 2, 2, 3, 5))
DF2 <- data.frame(l = c("A", "B", "C", "D", "E"),
stringsAsFactors = FALSE)
# Create a column called a in DF2 shows the row id
DF2$a <- row.names(DF2)
# Merge DF1 and DF2 by a
DF3 <- merge(DF1, DF2, by = "a", all.x = TRUE)
# Change the name of column l to be b
names(DF3) <- c("a", "b")
DF3
# a b
# 1 1 A
# 2 2 B
# 3 2 B
# 4 3 C
# 5 5 E

Keeping same name columns adjacent after merge

I have two data tables with lots of columns. The columns are the same but they are from different time points (one is from 2015 and one is from today). The structure of the data tables is roughly something like this:
library(data.table)
dt1 <- data.table(id = c("A", "B", "C"), i = c(2,4,6), a = c(1,2,3), w = c(2,3,4), f = c(2,3,5))
old_dt1 <- data.table(id = c("A", "B", "C"), i = c(1,2,6), a = c(1,1,1), w = c(2,1,2), f = c(1,3,1))
I would like to join them by id but I want that the columns with the same name are placed next to each other.
My problem is that when I merge (which is expected) I get the following result:
> merge(dt1, old_dt1, by = "id", suffixes = c("", "-2015"))
id i a w f i-2015 a-2015 w-2015 f-2015
1: A 2 1 2 2 1 1 2 1
2: B 4 2 3 3 2 1 1 3
3: C 6 3 4 5 6 1 2 1
I know I can manually reorder the data table by setcolorder but I was wondering if I am missing something simple (unfortunately the columns are not in alphabetical order so that is not an option...)
What I would like to get is the following:
result <- merge(dt1, old_dt1, by = "id", suffixes = c("", "-2015"))
setcolorder(result, c(1,2,6,3,7,4,8,5,9))
> result
id i i-2015 a a-2015 w w-2015 f f-2015
1: A 2 1 1 1 2 2 2 1
2: B 4 2 2 1 3 1 3 3
3: C 6 6 3 1 4 2 5 1
If the columns are already ordered in the two datasets, then create a matrix with 2 rows based on the column names excluding the first i.e. 'id', concatenate with 'id' and the set the column order
setcolorder(result, c(names(result)[1], matrix(names(result)[-1], nrow=2, byrow=TRUE)))
result
# id i i-2015 a a-2015 w w-2015 f f-2015
#1: A 2 1 1 1 2 2 2 1
#2: B 4 2 2 1 3 1 3 3
#3: C 6 6 3 1 4 2 5 1

How to remove individuals with fewer than 5 observations from a data frame [duplicate]

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed last month.
To clarify the question I'll briefly describe the data.
Each row in the data.frame is an observation, and the columns represent variables pertinent to that observation including: what individual was observed, when it was observed, where it was observed, etc. I want to exclude/filter individuals for which there are fewer than 5 observations.
In other words, if there are fewer than 5 rows where individual = x, then I want to remove all rows that contain individual x and reassign the result to a new data.frame. I'm aware of some brute force techniques using something like names == unique(df$individualname) and then subsetting out those names individually and applying nrow to determine whether or not to exclude them...but there has to be a better way. Any help is appreciated, I'm still pretty new to R.
An example using group_by and filter from dplyr package:
library(dplyr)
df <- data.frame(id=c(rep("a", 2), rep("b", 5), rep("c", 8)),
foo=runif(15))
> df
id foo
1 a 0.8717067
2 a 0.9086262
3 b 0.9962453
4 b 0.8980123
5 b 0.1535324
6 b 0.2802848
7 b 0.9366375
8 c 0.8109557
9 c 0.6945285
10 c 0.1012925
11 c 0.6822955
12 c 0.3757085
13 c 0.7348635
14 c 0.3026395
15 c 0.9707223
df %>% group_by(id) %>% filter(n()>= 5) %>% ungroup()
Source: local data frame [13 x 2]
id foo
(fctr) (dbl)
1 b 0.9962453
2 b 0.8980123
3 b 0.1535324
4 b 0.2802848
5 b 0.9366375
6 c 0.8109557
7 c 0.6945285
8 c 0.1012925
9 c 0.6822955
10 c 0.3757085
11 c 0.7348635
12 c 0.3026395
13 c 0.9707223
or with base R:
> df[df$id %in% names(which(table(df$id)>=5)), ]
id foo
3 b 0.9962453
4 b 0.8980123
5 b 0.1535324
6 b 0.2802848
7 b 0.9366375
8 c 0.8109557
9 c 0.6945285
10 c 0.1012925
11 c 0.6822955
12 c 0.3757085
13 c 0.7348635
14 c 0.3026395
15 c 0.9707223
Still in base R, using with is a more elegant way to do the very same thing:
df[with(df, id %in% names(which(table(id)>=5))), ]
or:
subset(df, with(df, id %in% names(which(table(id)>=5))))
Another way to do the same thing using the data.table package.
library(data.table)
set.seed(1)
dt <- data.table(id=sample(1:4,20,replace=TRUE),var=sample(1:100,20))
dt1<-dt[,count:=.N,by=id][(count>=5)]
dt2<-dt[,count:=.N,by=id][(count<5)]
dt1
id var count
1: 2 94 5
2: 2 22 5
3: 3 64 5
4: 4 13 6
5: 4 37 6
6: 4 2 6
7: 3 36 5
8: 3 81 5
9: 3 90 5
10: 2 17 5
11: 4 72 6
12: 2 57 5
13: 3 67 5
14: 4 9 6
15: 2 60 5
16: 4 34 6
dt2
id var count
1: 1 26 4
2: 1 31 4
3: 1 44 4
4: 1 54 4
It can be also with data.table using a logical condition with if after grouping by 'id'
library(data.table)
setDT(df)[, if(.N >=5) .SD, id]
# id foo
# 1: b 0.9962453
# 2: b 0.8980123
# 3: b 0.1535324
# 4: b 0.2802848
# 5: b 0.9366375
# 6: c 0.8109557
# 7: c 0.6945285
# 8: c 0.1012925
# 9: c 0.6822955
#10: c 0.3757085
#11: c 0.7348635
#12: c 0.3026395
#13: c 0.9707223
data
df <- structure(list(id = c("a", "a", "b", "b", "b", "b", "b", "c",
"c", "c", "c", "c", "c", "c", "c"), foo = c(0.8717067, 0.9086262,
0.9962453, 0.8980123, 0.1535324, 0.2802848, 0.9366375, 0.8109557,
0.6945285, 0.1012925, 0.6822955, 0.3757085, 0.7348635, 0.3026395,
0.9707223)), .Names = c("id", "foo"), class = "data.frame",
row.names = c(NA, -15L))
you can also use table. take for instance the data.frame mtcars
table(mtcars$cyl)
you will see that cyl has 3 values 4 6 8. there are 7 cars with 6 cylinders and if you want to exclude observations with less than 10 then you can exclude the cars with 6 cylinders like that
mtcars[!mtcars$cyl%in%names(table(mtcars$cyl)[table(mtcars$cyl)<10]),]
this will exclude observations using %in% names and table alone

Count changes to contents of a character vector [duplicate]

This question already has answers here:
Create group number for contiguous runs of equal values
(4 answers)
Closed 7 years ago.
I have a data_frame where a character variable x changes in time. I want to count the number of times it changes, and fill a new vector with this count.
df <- data_frame(
x = c("a", "a", "b", "b", "c", "b"),
wanted = c(1, 1, 2, 2, 3, 4)
)
x wanted
1 a 1
2 a 1
3 b 2
4 b 2
5 c 3
6 b 4
This is similar to, but different from rle(df$x), which would return
Run Length Encoding
lengths: int [1:4] 2 2 1 1
values : chr [1:4] "a" "b" "c" "b"
I could try to rep() that output. I have also tried this, which is awfully close, but not for reasons I can't figure out immediately:
df %>% mutate(
try_1 = cumsum(ifelse(x == lead(x) | is.na(lead(x)), 1, 0))
)
Source: local data frame [6 x 3]
x wanted try_1
1 a 1 1
2 a 1 1
3 b 2 2
4 b 2 2
5 c 3 2
6 b 4 3
It seems like there should be a function that does this directly, that I just haven't found in my experience.
Try this dplyr code:
df %>%
mutate(try_1 = cumsum(ifelse(x != lag(x) | is.na(lag(x)), 1, 0)))
x wanted try_1
1 a 1 1
2 a 1 1
3 b 2 2
4 b 2 2
5 c 3 3
6 b 4 4
Yours was saying: increment the count if a value is the same as the following row's value, or if the following row's value is NA.
This says: increment the count if the variable on this row either is different than the one on the previous row, or if there wasn't one on the previous row (e.g., row 1).
You can try
library(data.table) #data.table_1.9.5
setDT(df)[, wanted := rleid(x)][]
# x wanted
#1: a 1
#2: a 1
#3: b 2
#4: b 2
#5: c 3
#6: b 4
Or a base R option would be
inverse.rle(within.list(rle(as.character(df$x)),
values<- seq_along(values)))
#[1] 1 1 2 2 3 4
data
df <- data.frame(x=c("a", "a", "b", "b", "c", "b"))

Resources