I have a dataframe, A, which looks like this:
col 1 col2 col3
NL 6 9
UK 5 5
US 9 7
and I have a dataframe, B, consisting of a subset of the rows of the large dataframe looking like this:
col 1 col2 col3
NL 6 9
UK 5 5
Now, I want to find the indices of the rows from B in A, so it should return 1 and 2. Does someone know how to do this?
EDIT
Next, I also want to find the indices of the rows in A, when I have only the first two columns in B. So, in that case it should also return 1 and 2. Anyone an idea how to do this?
Generally, match gets the index. In our case, an approach is to paste the rows together and get the index with match
match(do.call(paste, df2), do.call(paste, df1)
If there are only subset of columns that are having the same column names, get the vector of column names with intersect, subset the datasets, do the paste and get the index with match
nm1 <- intersect(names(df1), names(df2))
match(do.call(paste, df2[nm1]), do.call(paste, df1[nm1]))
Another option is join where we create a row index in both datasets, do a join and extract the row index
library(dplyr)
df2 %>%
mutate(rn = row_number()) %>%
left_join(df2 %>%
mutate(rn = row_number()), by = c('col1', 'col2', 'col3')) %>%
pull(rn.y)
Related
I do have two dataframes with one ID-Variable in the first df ("ID") and three in the second df ("SIC","Ur","Sonst"). Now I am trying to merge these two datasets by checking if the "ID" in the first df either matches with SIC, Ur, or Sonst in the respective row. Here is my reproducable example:
df1 <- data.frame(ID = c("A", "B", "C","D"),
Value=c(1:4))
df2 <- data.frame(SIC = c("B", NA,NA,NA,NA,NA),
Ur = c(NA, "C", NA,NA,NA,NA),
Sonst=c(NA,NA,"A",NA,NA,NA),
Age=c(14:19))
Now I want a final df only with IDs and all information of the first df (as it is the target df) plus the corresponding age information, if ID either matches with SIC, Ur or Sonst. I have tried dplyr and merge function approaches but did not come up with a proper solution. I'm thankful for any suggestions.
An approach using dplyr and left_join with tidyrs unite
library(dplyr)
library(tidyr)
left_join(df1, df2 %>% unite("ID", SIC:Sonst, na.rm=T))
Joining with `by = join_by(ID)`
ID Value Age
1 A 1 16
2 B 2 14
3 C 3 15
4 D 4 NA
or an inner_join if you only want A, B and C to show up
inner_join(df1, df2 %>% unite("ID", SIC:Sonst, na.rm=T))
Joining with `by = join_by(ID)`
ID Value Age
1 A 1 16
2 B 2 14
3 C 3 15
The convenient way is perhaps to do with the tidyverse family as was nicely indicated by Andre Wildberg. You can do it also using base R merge() function but in your case we need to create an ifelse() function to put all non-missing values of the three columns from df2 into a single column:
df2$ID <- ifelse(!is.na(df2$SIC), df2$SIC,
ifelse(!is.na(df2$Ur), df2$Ur, df2$Sonst))
merge the two dfs:
df3 <- merge(df1, df2, by= "ID", all.x = TRUE)
Discard unwanted columns from merged data (df3):
df3 <- df3[, c("ID", "Value", "Age")]
df3
ID Value Age
A 1 16
B 2 14
C 3 15
D 4 NA
I have two dataframes: 1) an old dataframe (let's call it "df1") and 2) an updated dataframe ("df2"). I need to identify what has been added to or removed from df1 to create df2. So, I need a new dataframe with a new column identifying what rows should be added to or removed from df1 in order to get df2.
The two dataframes are of differing lengths, and Vessel_ID is the only unique identifier.
Here is a reproducible example:
df1 <- data.frame(Name=c('Vessel1', 'Vessel2', 'Vessel3', 'Vessel4', 'Vessel5'),
Vessel_ID=c('1','2','3','4','5'), special_NO=c(10,20,30,40,50),
stringsAsFactors=F)
df2 <- data.frame(Name=c('Vessel1', 'x', 'y', 'Vessel3', 'x', 'Vessel6'), Vessel_ID=c('1', '6', '7', '3', '5', '10'), special_NO=NA, stringsAsFactors=F)
Ideally I would want an output like this:
df3
Name Vessel_ID special_NO add_remove
Vessel2 2 20 remove
Vessel4 4 40 remove
Vessel6 10 NA add
x 6 NA add
y 7 NA add
Also, if the Vessel_ID matches, I want to substitute the special_NO from df1 for NA in df2...but maybe that's for another question.
I tried add a new column to both df1 and df2 to identify which df they originally belonged to, then merging the dataframes and using the duplicated () function. This seemed to work, but I still wasn't sure which rows to remove or to add, and got different results depending on if I specified fromLast=T or fromLast=F.
An approach using bind_rows
library(dplyr)
bind_rows(df1 %>% mutate(add_remove="remove"),
df2 %>% mutate(add_remove="add")) %>%
group_by(Vessel_ID) %>%
filter(n() == 1) %>%
ungroup()
# A tibble: 5 × 4
Name Vessel_ID special_NO add_remove
<chr> <chr> <dbl> <chr>
1 Vessel2 2 20 remove
2 Vessel4 4 40 remove
3 x 6 NA add
4 y 7 NA add
5 Vessel6 10 NA add
Thanks for the comment! That looks like it would work too. Here's another solution a friend gave me using all base R:
df1$old_new <- "old"
df2$old_new <- "new"
#' Use the full_join function in the dplyr package to join both data.frames based on Name and Vessel_ID
df.comb <- dplyr::full_join(df1, df2, by = c("Name", "Vessel_ID"))
#' If you want to go fully base, you can use the merge() function to get the same result.
# df.comb <- merge(df1, df2, by = c("Name", "Vessel_ID"), all = TRUE, sort = FALSE)
#' Create a new column that sets the 'status' of a row
#' If old_new.x is NA, that row came from df2, so it is "new"
df.comb$status[is.na(df.comb$old_new.x)] <- "new"
# If old_new.x is not NA and old_new.y is NA then that row was in df1, but isn't in df2, so it has been "deleted"
df.comb$status[!is.na(df.comb$old_new.x) & is.na(df.comb$old_new.y)] <- "deleted"
# If old_new.x is not NA and old_new.y is not NA then that row was in both df1 and df2 = "same"
df.comb$status[!is.na(df.comb$old_new.x) & !is.na(df.comb$old_new.y)] <- "same"
# only keep the columns you need
df.comb <- df.comb[, c("Name", "Vessel_ID", "special_NO", "status")]
This question already has answers here:
Selecting rows from a data frame from combinations of lists [duplicate]
(2 answers)
Closed 5 years ago.
I have a dataframe, dat:
dat<-data.frame(col1=rep(1:4,3),
col2=rep(letters[24:26],4),
col3=letters[1:12])
I want to filter dat on two different columns using ONLY the combinations given by the rows in the data frame filter:
filter<-data.frame(col1=1:3,col2=NA)
lists<-list(list("x","y"),list("y","z"),list("x","z"))
filter$col2<-lists
So for example, rows containing (1,x) and (1,y), would be selected, but not (1,z),(2,x), or (3,y).
I know how I would do it using a for loop:
#create a frame to drop results in
results<-dat[0,]
for(f in 1:nrow(filter)){
temp_filter<-filter[f,]
temp_dat<-dat[dat$col1==temp_filter[1,1] &
dat$col2%in%unlist(temp_filter[1,2]),]
results<-rbind(results,temp_dat)
}
Or if you prefer dplyr style:
require(dplyr)
results<-dat[0,]
for(f in 1:nrow(filter)){
temp_filter<-filter[f,]
temp_dat<-filter(dat,col1==temp_filter[1,1] &
col2%in%unlist(temp_filter[1,2])
results<-rbind(results,temp_dat)
}
results should return
col1 col2 col3
1 1 x a
5 1 y e
2 2 y b
6 2 z f
3 3 z c
7 3 x g
I would normally do the filtering using a merge, but I can't now since I have to check col2 against a list rather than a single value. The for loop works but I figured there would be a more efficient way to do this, probably using some variation of apply or do.call.
A solution using tidyverse. dat2 is the final output. The idea is to extract the value from the list column of filter data frame. Convert the filter data frame to the format as filter2 with the col1 and col2 columns having the same components in dat data frame. Finally, use semi_join to filter dat to create dat2.
By the way, filter is a pre-defined function in the dplyr package. In your example you used dplyr package, so it is better to avoid naming a data frame as filter.
library(tidyverse)
filter2 <- filter %>%
mutate(col2_a = map_chr(col2, 1),
col2_b = map_chr(col2, 2)) %>%
select(-col2) %>%
gather(group, col2, -col1)
dat2 <- dat %>%
semi_join(filter2, by = c("col1", "col2")) %>%
arrange(col1)
dat2
col1 col2 col3
1 1 x a
2 1 y e
3 2 y b
4 2 z f
5 3 z c
6 3 x g
Update
Another way to prepare the filter2 package, which does not need to know how many elements are in each list. The rest is the same as the previous solution.
library(tidyverse)
filter2 <- filter %>%
rowwise() %>%
do(data_frame(col1 = .$col1, col2 = flatten_chr(.$col2)))
dat2 <- dat %>%
semi_join(filter2, by = c("col1", "col2")) %>%
arrange(col1)
This is doable with a straight-forward join once you get the filter list back to a standard data.frame:
merge(
dat,
with(filter, data.frame(col1=rep(col1, lengths(col2)), col2=unlist(col2)))
)
# col1 col2 col3
#1 1 x a
#2 1 y e
#3 2 y b
#4 2 z f
#5 3 x g
#6 3 z c
Arguably, I'd do away with whatever process is creating those nested lists in the first place.
I have one dataframe (df1) with more than 200 columns containing data (several thousands of rows each). Column names are alphanumeric and all distinct from each other.
I have a second dataset (df2) with a couple of columns where the first column (named 'col1') contains rows with "values" carrying colnames of df1.
But not for every row in df2 I have a corresponding column in df1.
Now I would like to delete (drop) all rows in df2 where there is no "corresponding" column in df1.
I searched quite a while using keywords like "subset data.frame by values from another data.frame" but did not find any solution. I checked, e.g. here, here or here and some other places.
Thanks for your help.
Data:
df1 <- data.frame(a = 1:3, b = 1:3)
# a b
# 1 1 1
# 2 2 2
# 3 3 3
df2 <- data.frame(col1 = c("a", "c"))
# col1
# 1 a
# 2 c
Keep rows in df2 whose values are names in df1:
subset(df2, col1 %in% names(df1))
# col1
# 1 a
I have a data frame, df2, containing observations grouped by a ID factor that I would like to subset. I have used another function to identify which rows within each factor group that I want to select. This is shown below in df:
df <- data.frame(ID = c("A","B","C"),
pos = c(1,3,2))
df2 <- data.frame(ID = c(rep("A",5), rep("B",5), rep("C",5)),
obs = c(1:15))
In df, pos corresponds to the index of the row that I want to select within the factor level mentioned in ID, not in the whole dataframe df2.I'm looking for a way to select the rows for each ID according to the right index (so their row number within the level of each factor of df2).
So, in this example, I want to select the first value in df2 with ID == 'A', the third value in df2 with ID == 'B' and the second value in df2 with ID == 'C'.
This would then give me:
df3 <- data.frame(ID = c("A", "B", "C"),
obs = c(1, 8, 12))
dplyr
library(dplyr)
merge(df,df2) %>%
group_by(ID) %>%
filter(row_number() == pos) %>%
select(-pos)
# ID obs
# 1 A 1
# 2 B 8
# 3 C 12
base R
df2m <- merge(df,df2)
do.call(rbind,
by(df2m, df2m$ID, function(SD) SD[SD$pos[1], setdiff(names(SD),"pos")])
)
by splits the merged data frame df2m by df2m$ID and operates on each part; it returns results in a list, so they must be rbinded together at the end. Each subset of the data (associated with each value of ID) is filtered by pos and deselects the "pos" column using normal data.frame syntax.
data.table suggested by #DavidArenburg in a comment
library(data.table)
setkey(setDT(df2),"ID")[df][,
.SD[pos[1L], !"pos", with=FALSE]
, by = ID]
The first part -- setkey(setDT(df2),"ID")[df] -- is the merge. After that, the resulting table is split by = ID, and each Subset of Data, .SD is operated on. pos[1L] is subsetting in the normal way, while !"pos", with=FALSE corresponds to dropping the pos column.
See #eddi's answer for a better data.table approach.
Here's the base R solution:
df2$pos <- ave(df2$obs, df2$ID, FUN=seq_along)
merge(df, df2)
ID pos obs
1 A 1 1
2 B 3 8
3 C 2 12
If df2 is sorted by ID, you can just do df2$pos <- sequence(table(df2$ID)) for the first line.
Using data.table version 1.9.5+:
setDT(df2)[df, .SD[pos], by = .EACHI, on = 'ID']
which merges on ID column, then selects the pos row for each of the rows of df.