Related
i have list of dataframes and the dataframes have some duplicated columns. I want to merge duplicated columns which row is greater than others(some data frames have much more duplicates).
example data:
temp <- data.frame(seq_len(15), 5, 3)
colnames(temp) <- c("A", "A", "B")
temp$A[5]=NA
temp$A[3]=NA
temp$A[2]=NA
temp[7,2]=NA
A A B
<int> <dbl> <dbl>
1 5 3
NA 5 3
NA 5 3
4 5 3
NA 5 3
6 5 3
7 NA 3
8 5 3
9 5 3
10 5 3
final output
A B
<int> <dbl>
1 3
5 3
5 3
5 3
5 3
6 3
7 3
8 3
9 3
10 3
Thanks for everyone
A base R approach would be to split the data frame based on similarity of columns and select row-wise maximum using do.call + pmax.
data.frame(sapply(split.default(temp, names(temp)), function(x)
do.call(pmax, c(x, na.rm = TRUE))))
# A B
#1 5 3
#2 5 3
#3 5 3
#4 5 3
#5 5 3
#6 6 3
#7 7 3
#8 8 3
#9 9 3
#10 10 3
#11 11 3
#12 12 3
#13 13 3
#14 14 3
#15 15 3
Suppose you have the following list in R:
list_test <- list(c(2,4,5, 6), c(1,2,3), c(7,8))
What I am looking for is a dataframe of the following form:
value list_index
2 1
4 1
5 1
6 1
1 2
2 2
3 2
7 3
8 3
I tried to find a solution with the tidyverse but either lost the the list_index/name or had problems with the unequal length of the vectors.
You can give name to the list and then use stack in base R.
names(list_test) <- seq_along(list_test)
stack(list_test)
# values ind
#1 2 1
#2 4 1
#3 5 1
#4 6 1
#5 1 2
#6 2 2
#7 3 2
#8 7 3
#9 8 3
If interested in a tidyverse solution we can use enframe with unnest.
tibble::enframe(list_test) %>% tidyr::unnest(value)
Or imap_dfr from purrr.
purrr::imap_dfr(list_test, ~tibble::tibble(value = .x, list_index = .y))
Another option could be:
map_dfr(list_test, ~ enframe(.) %>%
select(-name), .id = "name")
name value
<chr> <dbl>
1 1 2
2 1 4
3 1 5
4 1 6
5 2 1
6 2 2
7 2 3
8 3 7
9 3 8
Or if you don't mind to have a column also with vector indexes:
map_dfr(list_test, enframe, .id = "name_list")
name_list name value
<chr> <int> <dbl>
1 1 1 2
2 1 2 4
3 1 3 5
4 1 4 6
5 2 1 1
6 2 2 2
7 2 3 3
8 3 1 7
9 3 2 8
In base R, we can use lengths to replicate the sequence and unlist the list elements into a two column 'data.frame'
data.frame(value = unlist(list_test),
list_index = rep(seq_along(list_test), lengths(list_test)))
# value list_index
#1 2 1
#2 4 1
#3 5 1
#4 6 1
#5 1 2
#6 2 2
#7 3 2
#8 7 3
#9 8 3
I'm lost, so any directions would be helpful. Let's say I have a dataframe:
df <- data.frame(
id = 1:12,
v1 = rep(c(1:4), 3),
v2 = rep(c(1:3), 4),
v3 = rep(c(1:6), 2),
v4 = rep(c(1:2), 6))
My goal would be to recode 2=4 and 4=2 for variables v3 and v4 but only for the first 4 cases (id < 5). I'm looking for a solution that works for up to twenty variables. I know how to do basic recoding but I don't see a simple way to implement the subset condition while manipulating multiple variables.
Here is a base R solution,
df[1:5, c('v3', 'v4')] <- lapply(df[1:5, c('v3', 'v4')], function(i)
ifelse(i == 2, 4, ifelse(i == 4, 2, i)))
which gives,
id v1 v2 v3 v4
1 1 1 1 1 1
2 2 2 2 4 4
3 3 3 3 3 1
4 4 4 1 2 4
5 5 1 2 5 1
6 6 2 3 6 2
7 7 3 1 1 1
8 8 4 2 2 2
9 9 1 3 3 1
10 10 2 1 4 2
11 11 3 2 5 1
12 12 4 3 6 2
You can try mutate_at with case_when in dplyr
library(dplyr)
df %>%
mutate_at(vars(v3:v4), ~case_when(id < 5 & . == 4 ~ 2L,
id < 5 & . == 2 ~ 4L,
TRUE ~.))
# id v1 v2 v3 v4
#1 1 1 1 1 1
#2 2 2 2 4 4
#3 3 3 3 3 1
#4 4 4 1 2 4
#5 5 1 2 5 1
#6 6 2 3 6 2
#7 7 3 1 1 1
#8 8 4 2 2 2
#9 9 1 3 3 1
#10 10 2 1 4 2
#11 11 3 2 5 1
#12 12 4 3 6 2
With mutate_at you can specify range of columns to apply the function.
Another, more direct, option is to get the indices of the numbers to replace, and to replace them by 6 minus the number (6-4=2, 6-2=4):
whToChange <- which(df[1:5, c("v3", "v4")] ==2 | df[1:5, c("v3", "v4")]==4, arr.ind=TRUE)
df[, c("v3", "v4")][whToChange] <- 6-df[, c("v3", "v4")][whToChange]
head(df, 5)
# id v1 v2 v3 v4
#1 1 1 1 1 1
#2 2 2 2 4 4
#3 3 3 3 3 1
#4 4 4 1 2 4
#5 5 1 2 5 1
You can use match and a lookup table - just in chase you have to recede more than two values.
rosetta <- matrix(c(2,4,4,2), 2)
df[1:4, c("v3", "v4")] <- lapply(df[1:4, c("v3", "v4")], function(x) {
i <- match(x, rosetta[1,]); j <- !is.na(i); "[<-"(x, j, rosetta[2, i[j]])})
df
# id v1 v2 v3 v4
#1 1 1 1 1 1
#2 2 2 2 4 4
#3 3 3 3 3 1
#4 4 4 1 2 4
#5 5 1 2 5 1
#6 6 2 3 6 2
#7 7 3 1 1 1
#8 8 4 2 2 2
#9 9 1 3 3 1
#10 10 2 1 4 2
#11 11 3 2 5 1
#12 12 4 3 6 2
Have also a look at R: How to recode multiple variables at once or Recoding multiple variables in R
I would like to use the following data frame
time <- c("01/01/1951", "02/01/1951", "03/01/1951", "04/01/1951", "03/03/1953", "04/03/1953", "05/03/1953", "06/03/1953", "02/01/1951", "03/01/1951", "04/01/1951", "05/01/1951", "13/03/1953", "14/03/1953", "15/03/1953", "16/03/1953", "01/05/1951", "02/05/1951", "03/05/1951", "04/05/1951", "04/03/1953", "05/03/1953", "06/03/1953", "07/03/1953")
member <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3)
trainall <- data.frame(time, member)
trainall$time = as.Date(trainall$time,format="%d/%m/%Y")
to order it by group of consecutive days based on the members. therefore if the same days are in member 2 and 1 I dont want them grouped together as consecutive!
ultimately I want a new column making this group
this is what I tried but it didnt work
y = sort(trainall$time)
trainall$g = cumsum(c(1, abs(y[-length(y)] - y[-1]) > 1))
this is the outcome I want.
trainall
time member g
1 01/01/1951 1 1
2 02/01/1951 1 1
3 03/01/1951 1 1
4 04/01/1951 1 1
5 03/03/1953 1 2
6 04/03/1953 1 2
7 05/03/1953 1 2
8 06/03/1953 1 2
9 02/01/1951 2 3
10 03/01/1951 2 3
11 04/01/1951 2 3
12 05/01/1951 2 3
13 13/03/1953 2 4
14 14/03/1953 2 4
15 15/03/1953 2 4
16 16/03/1953 2 4
17 01/05/1951 3 5
18 02/05/1951 3 5
19 03/05/1951 3 5
20 04/05/1951 3 5
21 04/03/1953 3 6
22 05/03/1953 3 6
23 06/03/1953 3 6
24 07/03/1953 3 6
ultimately this is the outcome I want. however, here I did it manually and my actual data frame is much much larger (16 members)
anyone know how to easily do this?
The use of logical values as integers 0 and 1 and your friend diff can do the trick. Something like this should do it, provided that your data is sorted by member and time.
# Your data
time <- c("01/01/1951", "02/01/1951", "03/01/1951", "04/01/1951", "03/03/1953", "04/03/1953", "05/03/1953", "06/03/1953", "02/01/1951", "03/01/1951", "04/01/1951", "05/01/1951", "13/03/1953", "14/03/1953", "15/03/1953", "16/03/1953", "01/05/1951", "02/05/1951", "03/05/1951", "04/05/1951", "04/03/1953", "05/03/1953", "06/03/1953", "07/03/1953")
member <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3)
trainall <- data.frame(time, member)
trainall$time = as.Date(trainall$time,format="%d/%m/%Y")
# Creating column g
trainall$g <- cumsum(c(1, (abs(diff(trainall$time)) + diff(trainall$member))!=1))
print(trainall)
# time member g
#1 1951-01-01 1 1
#2 1951-01-02 1 1
#3 1951-01-03 1 1
#4 1951-01-04 1 1
#5 1953-03-03 1 2
#6 1953-03-04 1 2
#7 1953-03-05 1 2
#8 1953-03-06 1 2
#9 1951-01-02 2 3
#10 1951-01-03 2 3
#11 1951-01-04 2 3
#12 1951-01-05 2 3
#13 1953-03-13 2 4
#14 1953-03-14 2 4
#15 1953-03-15 2 4
#16 1953-03-16 2 4
#17 1951-05-01 3 5
#18 1951-05-02 3 5
#19 1951-05-03 3 5
#20 1951-05-04 3 5
#21 1953-03-04 3 6
#22 1953-03-05 3 6
#23 1953-03-06 3 6
#24 1953-03-07 3 6
Edit: Added abs() around the time difference. I guess the abs cannot strictly be omitted as you could have a time difference of -2 days when the member changes, which cause the sum to be 1.
Edit 2: Re. your extra comment, try
trainall$G <- sequence(table(trainall$g))
Here is one option with .GRP from data.table
library(data.table)
setDT(trainall)[, g := .GRP, .(member, grp = cumsum(c(FALSE, diff(time) != 1)))]
trainall
# time member g
# 1: 1951-01-01 1 1
# 2: 1951-01-02 1 1
# 3: 1951-01-03 1 1
# 4: 1951-01-04 1 1
# 5: 1953-03-03 1 2
# 6: 1953-03-04 1 2
# 7: 1953-03-05 1 2
# 8: 1953-03-06 1 2
# 9: 1951-01-02 2 3
#10: 1951-01-03 2 3
#11: 1951-01-04 2 3
#12: 1951-01-05 2 3
#13: 1953-03-13 2 4
#14: 1953-03-14 2 4
#15: 1953-03-15 2 4
#16: 1953-03-16 2 4
#17: 1951-05-01 3 5
#18: 1951-05-02 3 5
#19: 1951-05-03 3 5
#20: 1951-05-04 3 5
#21: 1953-03-04 3 6
#22: 1953-03-05 3 6
#23: 1953-03-06 3 6
#24: 1953-03-07 3 6
I have two dataframes a and b which I would like to combine
a <- data.frame(g=c("1","2","2","3","3","3","4","4","4","4"),h=c("1","1","2","1","2","3","1","2","3","4"))
b <- data.frame(g=c("1","2","3","3","3","4","4","4","4","4"),i=c("1","2","3","2","1","2","3","4","5","6"))
g represents a grouping variable and h and i the columns I want to merge/join
> a
g h
1 1 1
2 2 1
3 2 2
4 3 1
5 3 2
6 3 3
7 4 1
8 4 2
9 4 3
10 4 4
> b
g i
1 1 1
2 2 2
3 3 3
4 3 2
5 3 1
6 4 2
7 4 3
8 4 4
9 4 5
10 4 6
a and b should be merged on the level of the grouping variable g whereas identical values of h and i should be put together (independant of the order they appear in h/i) and not identical values should be combined once (not all possible combinations).
a final df would look like:
g h i
1 1 1 1
2 2 1 <NA>
3 2 2 2
4 3 1 1
5 3 2 2
6 3 3 3
7 4 1 <NA>
8 4 2 2
9 4 3 3
10 4 4 4
11 4 <NA> 5
12 4 <NA> 6
I need that df to perform a correlation analysis.
Sounds like a merge on h==i, while retaining i, so create a new variable x to join on, and keep join results from both sides (all=TRUE). With a large hat-tip to #Moody_Mudskipper:
merge(transform(a,x=h), transform(b,x=i), all=TRUE)
# g x h i
#1 1 1 1 1
#2 2 1 1 <NA>
#3 2 2 2 2
#4 3 1 1 1
#5 3 2 2 2
#6 3 3 3 3
#7 4 1 1 <NA>
#8 4 2 2 2
#9 4 3 3 3
#10 4 4 4 4
#11 4 5 <NA> 5
#12 4 6 <NA> 6
We can also do this with dplyr
library(dplyr)
a %>%
mutate(x = h) %>%
full_join(mutate(b, x = i)) %>%
select(-x)