r - dplyr full_join using column position - r

I the following dataframes:
a <- c(1,1,1)
b<- c(10,8,2)
c<- c(2,2)
d<- c(3,5)
AB<- data.frame(a,b)
CD<- data.frame(c,d)
I would like to join AB and CD, where the first column of CD is equal to the second column of AB. Please note that my actual data will have a varying number of columns, with varying names, so I am really looking for a way to join based on position only. I have been trying this:
#Get the name of the last column in AB
> colnames(AB)[ncol(AB)]
[1] "b"
#Get the name of the first column in CD
> colnames(CD)[1]
[1] "c"
Then I attempt to join like this:
> abcd <- full_join(AB, CD, by = c(colnames(AB)[ncol(AB)]=colnames(CD)[1]))
Error: unexpected '=' in "abcd <- full_join(AB, CD, by = c(colnames(AB)[ncol(AB)]="
The behavior I am looking for is essentially this:
> abcd<- full_join(AB, CD, by = c("b" = "c"))
> abcd
a b d
1 1 10 NA
2 1 8 NA
3 1 2 3
4 1 2 5

We can do setNames
full_join(AB, CD, setNames(colnames(CD)[1], colnames(AB)[ncol(AB)]))
# a b d
#1 1 10 NA
#2 1 8 NA
#3 1 2 3
#4 1 2 5

We can replace the target column names with a common name, such as "Target", and then do full_join. Finally, replace the "Target" name with the original column name.
library(dplyr)
AB_name <- names(AB)
target_name <- AB_name[ncol(AB)] # Store the original column name
AB_name[ncol(AB)] <- "Target" # Set a common name
names(AB) <- AB_name
CD_name <- names(CD)
CD_name[1] <- "Target" # Set a common name
names(CD) <- CD_name
abcd <- full_join(AB, CD, by = "Target") %>% # Merge based on the common name
rename(!!target_name := Target) # Replace the common name with the original name
abcd
# a b d
# 1 1 10 NA
# 2 1 8 NA
# 3 1 2 3
# 4 1 2 5

Related

Add new variable to specific position in dataframe without specifying a numbered position

In Stata, I can create a variable after or before another one. E.g. gen age=., after(sex)
I would like to do the same in R. Is it possible?
My database has 300 variables, so I don't want to count it to discover its numbered position and also I might change from time to time.
You could do:
library(tibble)
data <- data.frame(a = c(1,2,3), b = c(1,2,3), c = c(1,2,3))
add_column(data, d = "", .after = "b")
# a b d c
# 1 1 1
# 2 2 2
# 3 3 3
Or another way could be:
data.frame(append(data, list(d = ""), after = match("b", names(data))))
First add the new column to the end of your data frame. Then, find the index of the column after which you want that new column to actually appear, and interpolate it:
df$new_col <- ...
index <- match("col_before", names(df))
df <- df[, c(names(df)[c(1:index)], "new_col", names(df)[c((index+1):(ncol(df)-1))])]
Sample:
df <- data.frame(v1=c(1:3), v2=c(4:6), v3=c(7:9))
df$new_col <- c(7,7,7)
index <- match("v2", names(df))
df <- df[, c(names(df)[c(1:index)], "new_col", names(df)[c((index+1):(ncol(df)-1))])]
df
v1 v2 new_col v3
1 1 4 7 7
2 2 5 7 8
3 3 6 7 9

How to fix the return of the matched elements in r

I have two data frame and I want to match them together and return the element back but the problem I got is if there is a match it returns the first element, not the matching element.
I tried this code but it returns the index of the element
data3 <- data1 %>%
mutate(score = lapply(M,function(x){ifelse (f<-(match(x, data2$words)), f )}))
Term score
1 I NA
2 A B 3, 7
3 A 3
4 Z NA
5 D 4
6 B 7
here is the code :
data1 <- data.frame(Term = c("I","A B","A","Z","D","B"))#txt
library(stringr)
M<-str_split(data1$Term , pattern = "\\s+")
data2 <- data.frame(words = c("O","C","A","D","E","F","B"))#dec
data3 <- data1 %>% mutate(score = lapply(M,function(x){ifelse (match(x, data2$words), data2$words )}))```
#here is the result I got
Term score
1 I NA
2 A B O, C
3 A O
4 Z NA
5 D O
6 B O
#and as you can see if there is a match it returns the first element.
#the result I expected
Term score
1 I NA
2 A B A, B
3 A A
4 Z NA
5 D D
6 B B
After getting the index, use that to get the corresponding 'words' from 'data2', paste the elements into a single string (toString) on the non-NA elements and change the blank ("") elements to NA with dplyr::na_if
data1$score <- sapply(M, function(x) {
x1 <- data2$words[match(x, data2$words)]
dplyr::na_if(toString(na.omit(x1)), "")})
data1$score
#[1] NA "A, B" "A" NA "D" "B"

Take sum of rows for every 3 columns in a dataframe

I have searched high and low and also tried multiple options to solve this but did not get the desired output as mentioned below:
I have dataframe df3 with headers as date and values beteween 0-1 as shown below:
df = data.frame(replicate(6,sample(0:1,6,rep=TRUE)))
colnames(df) = c("1/1/2018","1/2/2018","1/3/2018","1/4/2018","1/5/2018","1/6/2018")
df2 = data.frame(c("A","B","C","D","E","F"))
colnames(df2) = c("CUST_ID")
df3 = cbind(df2,df)
Now I need df4 in which sum of first 3 columns in series will form one column. This will be repeated in series for rest of the columns dynamically.
df4
Options I tried:
a) rbind.data.frame(apply(matrix(df3, nrow = n - 1), 1,sum))
b) col_list <- list(c("1/1/2018","1/2/2018","1/3/2018"), c("1/4/2018","1/5/2018","1/6/2018"))
lapply(col_list, function(x)sum(df3[,x])) %>% data.frame
One way would be to split df3 every 3 columns using split.default. To split the data we generate a sequence using rep, then for each dataframe we take rowSums and finally cbind the result together.
cbind(df3[1], sapply(split.default(df3[-1],
rep(1:ncol(df3), each = 3, length.out = (ncol(df3) -1))), rowSums))
# CUST_ID 1 2
#1 A 1 1
#2 B 2 0
#3 C 2 1
#4 D 1 1
#5 E 2 2
#6 F 2 2
FYI, the sequence generated from rep is
rep(1:ncol(df3), each = 3, length.out = (ncol(df3) -1))
#[1] 1 1 1 2 2 2
This makes it possible to split every 3 columns.
The results are different because OP used sample without set.seed.
If rep seems too long then we can generate the same sequence of columns using gl
gl(ncol(df3[-1])/3, 3)
#[1] 1 1 1 2 2 2
#Levels: 1 2
So the final code, would be
cbind(df3[1], sapply(split.default(df3[-1], gl(ncol(df3[-1])/3, 3)), rowSums))
We can use seq to create index, get the subset of columns within in a list, Reduce by taking the sum, and create new columns
df4 <- df3[1]
df4[paste0('col', c('123', '456'))] <- lapply(seq(2, ncol(df3), by = 3),
function(i) Reduce(`+`, df3[i:min((i+2), ncol(df3))]))
df4
# CUST_ID col123 col456
#1 A 2 2
#2 B 3 3
#3 C 1 3
#4 D 2 3
#5 E 2 1
#6 F 0 1
data
set.seed(123)
df <- data.frame(replicate(6,sample(0:1,6,rep=TRUE)))
colnames(df) <- c("1/1/2018","1/2/2018","1/3/2018","1/4/2018","1/5/2018","1/6/2018")
df2 <- data.frame(c("A","B","C","D","E","F"))
colnames(df2) = c("CUST_ID")
df3 <- cbind(df2, df)

How to use merge or replace to update a table in R with multiple columns

I want to do something VERY similar to this question: how to use merge() to update a table in R
but instead of just one column being the index, I want to match the new values on an arbitrary number of columns >=1.
foo <- data.frame(index1=c('a', 'b', 'b', 'd','e'),index2=c(1, 1, 2, 3, 2), value=c(100,NA, 101, NA, NA))
Which has the following values
foo
index1 index2 value
1 a 1 100
2 b 1 NA
3 b 2 101
4 d 3 NA
5 e 2 NA
And the data frame bar
bar <- data.frame(index1=c('b', 'd'),index2=c(1,3), value=c(200, 201))
Which has the following values:
bar
index1 index2 value
1 b 1 200
2 d 3 201
merge(foo, bar, by='index', all=T)
It results in this output:
Desired output:
foo
index1 index2 value
1 a 1 100
2 b 1 200
3 b 2 101
4 d 3 201
5 e 2 NA
I think you don't need a merge but more to rbind and filter them later. Here I am using data.table for sugar syntax.
dx <- rbind(bar,foo)
library(data.table)
setDT(dx)
## note this can be applied to any number of index
setkeyv(dx,grep("index",names(dx),v=T))
## using unqiue to remove all duplicated
## here it will remove the duplicated with missing values which is the
## expected behavior
unique(dx)
# index1 index2 value
# 1: b 1 200
# 2: b 2 101
# 3: d 3 201
# 4: a 1 100
# 5: e 2 NA
you can be more explicit and filter your rows by group of indexs:
dx[,ifelse(length(value)>1,value[!is.na(value)],value),key(dx)]
Here's an R base approach
> temp <- merge(foo, bar, by=c("index1","index2"), all=TRUE)
> temp$value <- with(temp, ifelse(is.na(value.x) & is.na(value.y), NA, rowSums(temp[,3:4], na.rm=TRUE)))
> temp <- temp[, -c(3,4)]
> temp
index1 index2 value
1 a 1 100
2 b 1 200
3 b 2 101
4 d 3 201
5 e 2 NA
You can use some dplyr voodoo to produce what you want. The following subsets the data by unique combinations of "index1" and "index2", and checks the contents of "value" for each subset. If "value" has any non-NA values, those are returned. If only an NA value is found, that is returned.
Seems a little specific, but it seems to do what you want!
library(dplyr)
df.merged <- merge(foo, bar, all = T) %>%
group_by(index1, index2) %>%
do(
if (any(!is.na(.$value))) {
return(subset(., !is.na(value)))
} else {
return(.)
}
)
Output:
index1 index2 value
<fctr> <fctr> <dbl>
1 a 1 100
2 b 1 200
3 b 2 101
4 d 3 201
5 e 2 NA
You can specify as many columns as you want with merge:
out <- merge(foo, bar, by=c("index1", "index2"), all.x=TRUE)
new <- apply(out[,3:4], 1, function(x) sum(x, na.rm=TRUE))
new <- ifelse(is.na(out[,3]) & is.na(out[,4]), NA, new)
out <- cbind(out[,1:2], new)

Changing the values of a column for the values from another column

I have two datasets that look like this:
What I want is to change the values from the second column in the first dataset to the values from the second column from the second dataset. All the names in the first dataset are in the second one, and obviously my dataset is much bigger than that.
I was trying to use R to do that but I am very new at it. I was looking at the intersect command but I am not sure if it's going to work. I don't put any codes because I'm real lost here.
I also need that the order of the first columns (which are names) in the first dataset stays the same, but with the new values from the second column of the second dataset.
Agree with #agstudy, a simple use of merge would do the trick. Try something like this:
df1 <- data.frame(name=c("ab23242", "ab35366", "ab47490", "ab59614"),
X=c(72722, 88283, 99999, 114278.333))
df2 <- data.frame(name=c("ab35366", "ab47490", "ab59614", "ab23242" ),
X=c(12345, 23456, 34567, 456789))
df.merge <- merge(df1, df2, by="name", all.x=T)
df.merge <- df.merge[, -2]
Output:
name X.y
1 ab23242 456789
2 ab35366 12345
3 ab47490 23456
4 ab59614 34567
I think merge will keep order of first frame but you can also keep the order strictly by simply adding a column with order df1$order <- 1:nrow(df1) and later on sorting based on that column.
df1<- data.frame( name1 = letters[6:10], valuecol1=seq(2,10,by=2))
df2 <- data.frame( name2 = letters[1:10], valuecol2=10:1)
df2 [ match(df1$name1, df2$name2) , "valuecol2"] <- df1[ df1$name1 %in% df2$name2 , "valuecol1"]
df2
name2 valuecol2
1 a 10
2 b 9
3 c 8
4 d 7
5 e 6
6 f 2
7 g 4
8 h 6
9 i 8
10 j 10
This is what I thought might work, but doing replacements using indexing with match sometimes bites me in ways I need to adjust:
df2 [match(df1$name1, df2$name2) , "valuecol2"] <-
df1[ match(df1$name1, df2$name2) , "valuecol1"]
Here's how I tested it (edited).
> df2 <- data.frame( name2 = letters[1:10], valuecol2=10:1)
> df1<- data.frame( name1 = letters[1:5], valuecol1=seq(2,10,by=2))
> df2 [ match(df1$name1, df2$name2) , "valuecol2"] <- df1[ match(df1$name1, df2$name2) , "valuecol1"]
> df2
name2 valuecol2
1 a 2
2 b 4
3 c 6
4 d 8
5 e 10
6 f 5
7 g 4
8 h 3
9 i 2
10 j 1
Yep.... bitten again.
> df1<- data.frame( name1 = letters[6:10], valuecol1=seq(2,10,by=2))
> df2 [ match(df1$name1, df2$name2) , "valuecol2"] <- df1[ match(df1$name1, df2$name2) , "valuecol1"]
> df2
name2 valuecol2
1 a 2
2 b 4
3 c 6
4 d 8
5 e 10
6 f NA
7 g NA
8 h NA
9 i NA
10 j NA
How about this:
library(data.table)
# generate some random data
dt.1 <- data.table(id = 1:1000, value=rnorm(1000), key="id")
dt.2 <- data.table(id = 2*(500:1), value=as.numeric(1:500), key="id")
# objective is to replace value in df.1 with value from df.2 where id's match.
# data table joins - very efficient
# dt.1 now has 3 columns: id, value, and value.1 from dt.2$value
dt.1 <-dt.2[dt.1,nomatch=NA]
dt.1[is.na(value),]$value=dt.1[is.na(value),]$value.1
dt.1$value.1=NULL # get rid of extra column
NB: This sorts dt.1 by id which should be OK since it's sorted that way already.
Also: In future, please include data that can be imported into R. Images are not useful!

Resources