convert values of a column based on another dataframe in R [duplicate] - r

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 6 years ago.
I have a data.frame
df1=data.frame(f=LETTERS[1:4],v=c(1:4))
f v
1 A 1
2 B 2
3 C 3
4 D 4
The first column is a list of factors, in which I have another data frame that houses these values, which are also factors
df2=data.frame(f=LETTERS[1:7],f2=letters[26:20])
f f2
1 A z
2 B y
3 C x
4 D w
5 E v
6 F u
I am wondering how to write a function so that I can alter the values from the first column of df1 to what they map to from df2. I would like to get:
f v
1 z 1
2 y 2
3 x 3
4 w 4
I tried a for loop with no success. Ant suggestions is greatly appreciated
Note: this is a simplified example of my work. A merge would add too many columns to work with and I don't think the extra memory storage would be very useful

We can use match
df1$f <- df2$f2[match(df1$f, df2$f)]
df1
# f v
#1 z 1
#2 y 2
#3 x 3
#4 w 4

You can use merge
merge(df1,df2,by = "f")[,c(1,3,2)]
f f2 v
1 A z 1
2 B y 2
3 C x 3
4 D w 4

library(dplyr)
left_join(df1,df2)

You could try using the merge function to merge the two tables, then specify which columns you want to keep.
For example:
df1 <- data.frame(f=LETTERS[1:4],v=c(1:4))
df2 <- data.frame(f=LETTERS[1:7],f2=letters[26:20])
merge(df1, df2, by.x = "f")[,c("f2", "v")]
f2 v
1 z 1
2 y 2
3 x 3
4 w 4

Related

How to keep previous row while removing duplicate from R dataframe [duplicate]

This question already has answers here:
Dropping common rows in two different dataframes
(3 answers)
Closed 4 years ago.
I have below mentioned dataframe:
DF1
T1 ID Type
1 A L
2 B Y
3 C B
4 D U
5 E Z
DF2
T1 ID Type
1 A L
2 B Y
3 F K
4 G I
5 H T
Now i want to merge DF1 and DF2 but every row should be unique in New_Data based on ID coloumn of both the data frame.
Required Dataframe:
New_Data
T1 ID Type
1 A L
2 B Y
3 C B
4 D U
5 E Z
3 F K
4 G I
5 H T
I think you can just use
unique(rbind(DF1,DF2))
Row bind the two data frames, then drop duplicates based on ID column or ID + Type columns (duplicated rows based on id columns from later data frames in bind_rows will be dropped):
bind_rows(df1, df2) %>% distinct(ID, Type, .keep_all = T)
# T1 ID Type
#1 1 A L
#2 2 B Y
#3 3 C B
#4 4 D U
#5 5 E Z
#6 3 F K
#7 4 G I
#8 5 H T
Based on ID column only:
bind_rows(df1, df2) %>% distinct(ID, .keep_all = T)
# T1 ID Type
#1 1 A L
#2 2 B Y
#3 3 C B
#4 4 D U
#5 5 E Z
#6 3 F K
#7 4 G I
#8 5 H T
I'm not sure if this is exactly what you wanted, but to combine the dataframes, you can use the merge function:
# merge two data frames by ID
New_Data <- merge(DF1, DF2 ,by="ID", all=TRUE)
The "all" parameter just means that for all IDs in DF1 and all IDs in DF2 there will be a row in New_Data. However, the merge should not duplicate rows. For further information, I suggest looking up inner and outer joins as well as the documentation for the merge function.
Here are some links:
joins diagram
docs 1
docs 2
Edit: Binding the rows will also work if you don't want to deal with merging. Row binds performs a vertical stacking of one dataframe on top of the other. To order the stacked data alphabetically, you could try:
New_Data <- unique(rbind( DF1, DF2))
New_Data <- New_Data[order(ID),]

join/merge data frames in R [duplicate]

This question already has answers here:
Merge dataframes of different sizes
(4 answers)
Left join using data.table
(3 answers)
Closed 5 years ago.
I would like to join similar data frames:
input:
x <- data_frame(a=c(1,2,3,4),b=c(4,5,6,7),c=c(1,NA,NA,NA))
y <- data_frame(a=c(2,3),b=c(5,6),c=c(1,2))
desired output:
z <- data_frame(a=c(1,2,3,4),b=c(4,5,6,7),c=c(1,1,2,NA))
I tried
x <- data_frame(a=c(1,2,3,4),b=c(4,5,6,7),c=c(1,NA,NA,NA))
y <- data_frame(a=c(2,3),b=c(5,6),c=c(1,2))
z <- merge(x,y, all=TRUE)
but it has one inconvenience:
a b c
1 1 4 1
2 2 5 1
3 2 5 NA
4 3 6 2
5 3 6 NA
6 4 7 NA
It doubles rows where there are similarities. Is there a way to get desired output without deleting unwanted rows?
EDIT
I can not delete rows with NA, x data frame consists of rows with NA which are not in y data frame. If I would do this I would deleted 4th row from x data frame (4 7 NA)
Thanks for help
You can use an update join with the data.table package:
# load the packge and convert the dataframes to data.table's
library(data.table)
setDT(x)
setDT(y)
# update join
x[y, on = .(a, b), c := i.c][]
which gives:
a b c
1: 1 4 1
2: 2 5 1
3: 3 6 2
4: 4 7 NA

Selecting rows from a data frame from combinations of lists [duplicate]

This question already has answers here:
Removing one table from another in R [closed]
(3 answers)
Closed 5 years ago.
I have a dataframe, dat:
dat<-data.frame(col1=rep(1:4,3),
col2=rep(letters[24:26],4),
col3=letters[1:12])
I want to filter dat on two different columns using ONLY the combinations given by the rows in the data frame filter:
filter<-data.frame(col1=1:3,col2=NA)
lists<-list(list("x","y"),list("y","z"),list("x","z"))
filter$col2<-lists
So for example, rows containing (1,x) and (1,y), would be selected, but not (1,z),(2,x), or (3,y).
I know how I would do it using a for loop:
#create a frame to drop results in
results<-dat[1,]
for(f in 1:nrow(filter)){
temp_filter<-filter[f,]
temp_dat<-dat[dat$col1==temp_filter[1,1] &
dat$col2%in%unlist(temp_filter[1,2]),]
results<-rbind(results,temp_dat)
}
Or if you prefer dplyr style:
require(dplyr)
results<-dat[0,]
for(f in 1:nrow(filter)){
temp_filter<-filter[f,]
temp_dat<-filter(dat,col1==temp_filter[1,1] &
col2%in%unlist(temp_filter[1,2])
results<-rbind(results,temp_dat)
}
results should return
col1 col2 col3
1 1 x a
5 1 y e
2 2 y b
6 2 z f
3 3 z c
7 3 x g
I would normally do the filtering using a merge, but I can't now since I have to check col2 against a list rather than a single value. The for loop works but I figured there would be a more efficient way to do this, probably using some variation of apply or do.call.
We could use dplyr::anti_join() to do the row exclusion filtering for us, if we had two dataframes:
index <- data.frame(col1 = as.character(filter[,1]),
col2 = filter[,2])
anti_join(dat, index)
Joining, by = c("col1", "col2")
col1 col2 col3
1 4 x d
2 1 y e
3 2 z f
4 3 x g
5 4 y h
6 1 z i
7 2 x j
8 3 y k
9 4 z l
mostly base with a little help from dplyr:
dplyr::setdiff(dat,merge(dat,setNames(as.data.frame(filter),names(dat)[1:2])))
col1 col2 col3
1 4 x d
2 1 y e
3 2 z f
4 3 x g
5 4 y h
6 1 z i
7 2 x j
8 3 y k
9 4 z l
A real base R solution though not so pretty and you lose the row order:
subset(merge(dat,`[[<-`(setNames(as.data.frame(filter),names(dat)[1:2]),"x",value=1),all.x=T),is.na(x),-4)
col1 col2 col3
2 1 y e
3 1 z i
4 2 x j
6 2 z f
7 3 x g
8 3 y k
10 4 x d
11 4 y h
12 4 z l

Create a new dataframe according to the contrast between two similar df [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 5 years ago.
I have a dataframe made like this:
X Y Z T
1 2 4 2
3 2 1 4
7 5 NA 3
After several steps (not important which one) i obtained this df:
X Y Z T
1 2 4 2
3 2 NA 4
7 5 NA 3
i want to obtain a new dataframe made by only the rows which didn't change during the steps; the result would be this one:
X Y Z T
1 2 4 2
7 5 NA 3
How could I do?
One option with base R would be to paste the rows of each dataset together and compare (==) to create a logical vector which we use for subsetting the new dataset
dfO[do.call(paste, dfO) == do.call(paste, df),]
# X Y Z T
#1 1 2 4 2
#3 7 5 NA 3
where 'dfO' is the old dataset and 'df' is the new
You can use dplyr's intersect function:
library(dplyr)
intersect(d1, d2)
# X Y Z T
#1 1 2 4 2
#2 7 5 NA 3
This is a data.frame-equivalent of base R's intersect function.
In case you're working with data.tables, that package also provides such a function:
library(data.table)
setDT(d1)
setDT(d2)
fintersect(d1, d2)
# X Y Z T
#1: 1 2 4 2
#2: 7 5 NA 3
Another dplyr solution: semi_join.
dt1 %>% semi_join(dt2, by = colnames(.))
X Y Z T
1 1 2 4 2
2 7 5 NA 3
Data
dt1 <- read.table(text = "X Y Z T
1 2 4 2
3 2 1 4
7 5 NA 3",
header = TRUE, stringsAsFactors = FALSE)
dt2 <- read.table(text = " X Y Z T
1 2 4 2
3 2 NA 4
7 5 NA 3",
header = TRUE, stringsAsFactors = FALSE)
I am afraid that neither semi join, nor intersect or merge are the correct answers. merge and intersect will not handle duplicate rows properly. semi join will change order of the rows.
From this perspective, I think the only correct one so far is akrun's.
You could also do something like:
df1[rowSums(((df1 == df2) | (is.na(df1) & is.na(df2))), na.rm = T) == ncol(df1),]
But I think akrun's way is more elegant and likely to perform better in terms of speed.

Picking up only specific columns based on conditions on multiple columns in R [duplicate]

This question already has answers here:
How to select the rows with maximum values in each group with dplyr? [duplicate]
(6 answers)
Closed 6 years ago.
I have a data frame, say
df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
y = c(1,1,1,1,1,2,3,1,1,2,3,4),
z = c("a","b","c","d","e","f","g","h","i","j","k","l"))
it looks like this
x y z
1 1 1 a
2 2 1 b
3 5 1 c
4 6 1 d
5 3 1 e
6 3 2 f
7 3 3 g
8 6 1 h
9 8 1 i
10 8 2 j
11 8 3 k
12 8 4 l
I would like pick unique elements from column x, based on column y such that y should be maximum (in this case say for row number 5 to 7 are 3'3, I would like to pick the x = 3 corresponding to y = 3 (maximum value) similarly for x = 8 I d like to pick y = 4 row )
the output should look like this
x y z
1 1 1 a
2 2 1 b
3 5 1 c
4 6 1 d
5 3 3 g
6 6 1 h
7 8 4 l
I have a solution for that, which I am posting in the solution, but if there is there any better method to achieve this, My solution only works in this specific case (picking the largest) what is the general case solution for this?
One solution using dplyr
library(dplyr)
df %>%
group_by(x) %>%
slice(max(y))
# x y z
# (dbl) (dbl) (chr)
#1 1 1 a
#2 2 1 b
#3 3 3 g
#4 5 1 c
#5 6 1 d
#6 8 4 l
The base R alternative is using aggregate
aggregate(y~x, df, max)
You can achieve the same result using a dplyr chain and dplyr's group_by function. Once you use a group_by function the rest of the functions in the chain are applied within group as opposed to the whole data.frame. So here I filter to where the only rows left are the max(y) per the grouping value of x. This can be extended to be used for the min of y or a particular value.
I think its generally good practice to ungroup the data at the end of a chain using group_by to avoid any unexpected behavior.
library(dplyr)
df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
y = c(1,1,1,1,1,2,3,1,1,2,3,4),
z = c("a","b","c","d","e","f","g","h","i","j","k","l"))
df %>%
group_by(x) %>%
filter(y==max(y)) %>%
ungroup()
To make it more general... say instead you wanted the mean of y for a given x as opposed to the max. You could then use the summarise function instead of the filter as shown below.
df %>%
group_by(x) %>%
summarise(y=mean(y)) %>%
ungroup()
Using data.table we can use df[order(z), .I[which.max(y)], by = x] to get the rownumbers of interest, eg:
library(data.table)
setDT(df)
df[df[order(z), .I[which.max(y)], by = x][, V1]]
x y z
1: 1 1 a
2: 2 1 b
3: 5 1 c
4: 6 1 d
5: 3 3 g
6: 8 4 l
Here is my solution using dplyr package
library(dplyr)
df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
y = c(1,1,1,1,1,2,3,1,1,2,3,4),
z = c("a","b","c","d","e","f","g","h","i","j","k","l"))
df <- arrange(df,desc(y))
df_out <- df[!duplicated(df$x),]
df_out
Printing df_out
x y z
1 8 4 l
2 3 3 g
6 1 1 a
7 2 1 b
8 5 1 c
9 6 1 d
Assuming the data frame is ordered by df[order(df$x, df$y),] as it is in the example, you can use base R functions, split, lapply, and do.call/rbind to extract your desired rows using the "split / apply / combine" methodology.
do.call(rbind, lapply(split(df, df$x), function(i) i[nrow(i),]))
x y z
1 1 1 a
2 2 1 b
3 3 3 g
5 5 1 c
6 6 1 h
8 8 4 l
split breaks up the data.frame into a list based on x. This list is fed to lapply which selects the last row of each data.frame, and returns these one row data.frames as a list. This list is then rbinded into a single data frame using do.call.

Resources