I want to combine to dataframes, df1 with 15.000 obs and df2 consisting of 2.3 mill. I'm trying to match values, if df1$col1 == df2$c1, AND df1$col2 == df2$c2, then insert value from df2$dummy, to df1$col3. If no match in both, do nothing. All are 8 digits, except df2$dummy, which is a dummy of 0 or 1.
df1 col1 col2 col3
1 25382701 65352617 -
2 22363658 45363783 -
3 20019696 23274747 -
df2 c1 c2 dummy
1 17472802 65548585 1
2 20383829 24747473 0
3 20019696 23274747 0
4 01382947 21930283 1
5 22123425 65382920 0
In the example the only match is row 3, and the value 0 from the dummy column should be inserted in col3 row3.
I've tried to make a look-up table, a function of for and if, but not found a solution when requiring matches in two dataframes. (No need to say I guess, but I'm new to R and programming..)
We can use a join in data.table
library(data.table)
df1$col3 <- NULL
setDT(df1)[df2, col3 := i.dummy, on = .(col1 = c1, col2 = c2)]
df1
# col1 col2 col3
#1: 25382701 65352617 NA
#2: 22363658 45363783 NA
#3: 20019696 23274747 0
data
df1 <- structure(list(col1 = c(25382701L, 22363658L, 20019696L), col2 = c(65352617L,
45363783L, 23274747L), col3 = c("-", "-", "-")), class = "data.frame", row.names = c("1",
"2", "3"))
df2 <- structure(list(c1 = c(17472802L, 20383829L, 20019696L, 1382947L,
22123425L), c2 = c(65548585L, 24747473L, 23274747L, 21930283L,
65382920L), dummy = c(1L, 0L, 0L, 1L, 0L)), class = "data.frame",
row.names = c("1",
"2", "3", "4", "5"))
Related
Given a data frame in R how do I determine the number of non blank values per row.
col1 col2 col3 rowCounts
1 3 2
1 6 2
1 1
0
This is how I did it in python:
df['rowCounts'] = df.apply(lambda x: x.count(), axis=1)
What is the R Code for this?
In base R, we can use (assuming NA as blank) rowSums as a vectorized option on the logical matrix (!is.na(df)) where TRUE (->1 i.e. non-NA) values will be added for each row with rowSums
df$rowCounts <- rowSums(!is.na(df))
-output
df
# col1 col2 col3 rowCounts
#1 1 3 NA 2
#2 NA 1 6 2
#3 NA NA 1 1
#4 NA NA NA 0
If the blank is ""
df$rowCounts <- rowSums(df != "", na.rm = TRUE)
Or with apply and MARGIN = 1 as a similar syntax to Python (though it will be slower compared to rowSums)
df$rowCounts <- apply(df, 1, function(x) sum(!is.na(x)))
data
df <- structure(list(col1 = c(1L, NA, NA, NA), col2 = c(3L, 1L, NA,
NA), col3 = c(NA, 6L, 1L, NA)), class = "data.frame", row.names = c(NA,
-4L))
I have a data frame with some error
T item V1 V2
1 a 2 .1
2 a 5 .8
1 b 1 .7
2 b 2 .2
I have another data frame with corrections for items concerning V1 only
T item V1
1 a 2
2 a 6
How do I get the final data frame? Should I use merge or rbind. Note: actual data frames are big.
An option would be a data.table join on the 'T', 'item' and assigning the 'V1' with the the corresponding 'V1' column (i.V1) from the second dataset
library(data.table)
setDT(df1)[df2, V1 := i.V1, on = .(T, item)]
df1
# T item V1 V2
#1: 1 a 2 0.1
#2: 2 a 6 0.8
#3: 1 b 1 0.7
#4: 2 b 2 0.2
data
df1 <- structure(list(T = c(1L, 2L, 1L, 2L), item = c("a", "a", "b",
"b"), V1 = c(2L, 5L, 1L, 2L), V2 = c(0.1, 0.8, 0.7, 0.2)),
class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(T = 1:2, item = c("a", "a"), V1 = c(2L, 6L)),
class = "data.frame", row.names = c(NA,
-2L))
This should work -
library(dplyr)
df1 %>%
left_join(df2, by = c("T", "item")) %>%
mutate(
V1 = coalesce(as.numeric(V1.y), as.numeric(V1.x))
) %>%
select(-V1.x, -V1.y)
I have two df.
df1
col1
1 a
2 b
3 c
4 c
df2
setID col1
1 1 a
2 1 b
3 1 b
4 1 a
5 2 w
6 2 v
7 2 c
8 2 b
9 3 a
10 3 a
11 3 b
12 3 a
13 4 a
14 4 b
15 4 c
16 4 a
I'm using the following code to match them.
scorematch <- function ()
{
require("dplyr")
#to make sure every element is preceded by the one before that element
combm <- rev(sapply(rev(seq_along(df1$col1)), function(i) paste0(df1$col1[i-1], df1$col1[i])));
tempdf <- df2
#group the history by their ID
tempdf <- group_by(tempdf, setID)
#collapse strings in history
tempdf <- summarise(tempdf, ss = paste(col1, collapse = ""))
tempdf <- rowwise(tempdf)
#add score based on how it matches compared to path
tempdf <- mutate(tempdf, score = sum(sapply(combm, function(x) sum(grepl(x, ss)))))
tempdf <- ungroup(tempdf)
#filter so that only IDs with scores more than 0 are available
tempdf <- filter(tempdf, score != 0)
tempdf <- pull(tempdf, setID)
#filter original history to reflect new history
tempdf2 <- filter(df2, setID %in% tempdf)
tempdf2
}
This code works great. But I want to take this further. I want to apply a sliding window function to get the df1 values I want to match against df2. So far I'm using this function as my sliding window.
slidingwindow <- function(data, window, step)
{
#data is dataframe with colname
total <- length(data)
#spots are start of each window
spots <- seq(from=1, to=(total-step), by=step)
result <- vector(length = length(spots))
for(i in 1:length(spots)){
...
}
return(result)
}
The scorematch function will be nested inside slidingwindow function. I'm unsure how to proceed from there though. Ideally df1 will be split into windows. Starting from the first window it will be matched against df2 using the scorematch function to get a filtered out df2. Then I want the second window of df1 to match against the newly filtered df2 and so on. The loop should end when df2 has been filtered down so that it contains only 1 distinct setID value. The final output can either be the whole filtered df2 or just the remaining setID.
Ideal output would be either
setID col1
1 4 a
2 4 b
3 4 c
4 4 a
or
[1] "4"
Here is a solution without using a for-loop. I use stringr because of its nice consistent syntax, purrr for map (although lapply would be sufficient in this case) and dplyr to group_by setID and collapse the strings for each group.
library(dplyr)
library(purrr)
library(stringr)
First I collapse the string for each group. This makes it easier to use pattern-matching with str_detect-later:
df2_collapse <- df2 %>%
group_by(setID) %>%
summarise(string = str_c(col1, collapse = ""))
df2_collapse
# A tibble: 4 x 2
# setID string
# <int> <chr>
# 1 1 abba
# 2 2 wvcb
# 3 3 aaba
# 4 4 abca
The "look-up" string is collapse as well and then the substrings (i.e. slding windows) are extract with str_sub. Here I work along the length of the string str_length and extract all possible groups following each letter in the string.
string <- str_c(df1$col1, collapse = "")
string
# [1] "abcc"
substrings <-
unlist(map(1:str_length(string), ~ str_sub(string, start = .x, end = .x:str_length(string))))
Store the substrings in a tibble with their length as score.
substrings
# [1] "a" "ab" "abc" "abcc" "b" "bc" "bcc" "c" "cc" "c"
substrings <- tibble(substring = substrings,
score = str_length(substrings))
substrings
# A tibble: 10 x 2
# substring score
# <chr> <int>
# 1 a 1
# 2 ab 2
# 3 abc 3
# 4 abcc 4
# 5 b 1
# 6 bc 2
# 7 bcc 3
# 8 c 1
# 9 cc 2
# 10 c 1
For each setID with extract the maximum score it matches in the substring-data and the filter out the row with the maximum score of all setIDs.
df2_collapse %>%
mutate(score = map_dbl(string,
~ max(substrings$score[str_detect(.x, substrings$substring)]))) %>%
filter(score == max(score))
# A tibble: 1 x 3
# setID string score
# <int> <chr> <dbl>
# 1 4 abca 3
Data
df1 <- structure(list(col1 = c("a", "b", "c", "c")),
class = "data.frame", row.names = c("1", "2", "3", "4"))
df2 <-
structure(list(setID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L),
col1 = c("a", "b", "b", "a", "w", "v", "c", "b", "a", "a", "b", "a", "a", "b", "c", "a")),
class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16"))
Say I have this data frame in R.
df <- data.frame( col1 = c(3,4,'NA','NA'), col2 = c('NA','NA',1,5))
col1 col2
1 3 NA
2 4 NA
3 NA 1
4 NA 5
I would like to have new column like this
col1 col2 col3
1 3 NA 3
2 4 NA 4
3 NA 1 1
4 NA 5 5
How shall I do that?
At the moment your df does not contains true NA but rather the strings 'NA'. You probably want to have true NA, as per #G5W comment.
Once we have true NA we can use:
df$col3 <- ifelse(is.na(df$col1), df$col2, df$col1)
or, with dplyr:
library(dplyr)
df$col3 <- coalesce(df$col1, df$col2)
We can use pmax or pmin to do this (from base R)
df$col3 <- do.call(pmax, c(df, na.rm=TRUE))
df$col3
#[1] 3 4 1 5
data
df <- structure(list(col1 = c(3L, 4L, NA, NA), col2 = c(NA, NA, 1L,
5L)), .Names = c("col1", "col2"), class = "data.frame", row.names = c("1",
"2", "3", "4"))
I'm trying to rename multiple column names of a data frame in which the columns contain more than a single type, the columns are a factor class.
col1 col2 col3 col4 col5 col6
a b c a b a
1 5 8 2 2 5
conditional on an entry in a row:
colnames(df)[which(df[1,]=="b " )]<-"new_colname"
Ideally producing something like:
col1 new_colname col3 col4 new_colname.2 col6
a b c a b a
1 5 8 2 2 5
But when I do this all the columns that are renamed have their data replaced with NAs, producing:
col1 col2 col3
NA NA NA
NA NA NA
Does anyone know why this would happen?
The
Suppose, the dataset columns are all "factor" class, convert the columns to "character" class.
df[] <- lapply(df, as.character)
In case, there are leading/lagging spaces, use str_trim to remove those spaces,
library(stringr)
df[] <- lapply(df, str_trim)
Change the column names based on the conditions mentioned, and use make.names for creating unique names for those duplicated column names.
names(df)[df[1,]=='b'] <- 'new_colname'
names(df) <- make.names(names(df), unique=TRUE)
df
# col1 new_colname col3 col4 new_colname.1 col6
#1 a b c a b a
#2 1 5 8 2 2 5
data
df <- structure(list(col1 = structure(c(2L, 1L), .Label = c("1", "a"
), class = "factor"), col2 = structure(c(2L, 1L), .Label = c("5",
"b"), class = "factor"), col3 = structure(c(2L, 1L), .Label = c("8",
"c"), class = "factor"), col4 = structure(c(2L, 1L), .Label = c("2",
"a"), class = "factor"), col5 = structure(c(2L, 1L), .Label = c("2",
"b"), class = "factor"), col6 = structure(c(2L, 1L), .Label = c("5",
"a"), class = "factor")), .Names = c("col1", "col2", "col3",
"col4", "col5", "col6"), row.names = c(NA, -2L), class = "data.frame")
In the end fixed this by naming them using a different method, using a for loop:
for(i in 1:length(df)){colnames(df)[i]<-paste("df", df[1,i],df[3,i], eval(i) ,sep="_" )}
This would probably not be feasible for an extremely large dataset, so if anyone knows how one might do this another way please post an answer.