I have a dataframe that i consider history. A small scale sample is below
historydf
ID col2
1 1 a
2 1 c
3 1 c
4 1 e
5 2 a
6 2 b
7 2 b
8 2 e
9 3 a
10 3 a
11 3 b
12 3 c
13 4 b
14 4 a
15 4 a
16 4 c
testdf
col1
1 a
2 a
3 b
4 e
I want to know if i can match test df to find an exact or the closest match within history df and output the ID(s).
I have a few conditions that must be met.
The test df cannot be matched in full against the history. It can be
element by element or even more than 1 element by element. But it
cannot be the whole df at once.
The sequence of match is important to get as close to the history.
A sample output step by step is below. In my case I am considering element by element due to the small scale of the sample.
testdf$col1[1] is "a". There are no prior elements so this is the start of the sequence. Since "a" appears in all IDs the output will be: ID = 1,2,3,4
testdf$col1[2] is "a". The prior element was "a". Now we look for "a" preceded by "a". Since "a","a," appears in 2 IDs the output will be: ID = 3,4
testdf$col1[3] is "b". The prior element was "a". Now we look for "b" preceded by "a". Since "a","b," appears in 1 ID the output will be: ID = 3
Now that only one ID remains the matching can stop and the final output is ID 3 is closest match to testdf.
It is important to note that the search parameter in history can be narrowed down with each successful match. For example during the second match in the above example the history can be narrowed down to only ID 3 and 4.
Hopefully the question is clear and I would appreciate any help as long as it follows the two conditions i mentioned.
Interesting problem. I would approach your problem by defining a similarity score; please take a look at the details and code I give below, and perhaps do some testing to see if this is consistent with your expected output for your actual data.
The solution involves the following approach:
Form the combination combn of successive col1 elements from testdf
combn <- rev(sapply(rev(seq_along(testdf$col1))[-1],
function(i) paste0(testdf$col1[i-1], testdf$col1[i])));
combn;
#[1] "a" "aa" "ab"
You can see that this corresponds to the three cases you test for in your example: (1) the first "a", (2) "a" preceded by "a", and (3) "b" preceded by "a".
We now group entries in historydf by ID and summarise entries in col2 by pasting characters together in column entries ss (one per ID). We then calculate a row-wise score by summing the number of matches of every combn element with ss; the larger the score the more matching successive entries from col1 are present in historydf$col2. We then simply extract the row with the largest score, and pull the corresponding ID (ID = 3 in your case).
historydf %>%
group_by(ID) %>%
summarise(ss = paste(col2, collapse = "")) %>%
rowwise() %>%
mutate(score = sum(sapply(combn, function(x) sum(grepl(x, ss))))) %>%
ungroup() %>%
filter(score == max(score)) %>%
pull(ID)
#[1] 3
Related
I have a dataframe returned from a function that looks like this:
df <- data.frame(data = c(1,2,3,4,5,6,7,8))
rownames(df) <- c('firsta','firstb','firstc','firstd','seconda','secondb','secondc','secondd')
firsta 1
seconda 5
firstb 2
secondb 6
my goal is to turn it into this:
df_goal <- data.frame(first = c(1,2,3,4), second = c(5,6,7,8))
rownames(df_goal) <- c('a','b','c','d')
first second
a 1 5
b 2 6
Basically the problem is that there is information in the row names that I can't discard because there isn't otherwise a way to distinguish between the column values.
This is a simple long-to-wide conversion; the twist is that we need to generate the key variable from the rownames by splitting the string appropriately.
In the data you present, the rowname consists of the concatination of a "position" (ie. 'first', 'second') and an id (ie. 'a', 'b'), which is stuck at the end. The structure of this makes splitting it complicated: ideally, you'd use a separator (ie. first_a, first_b) to make the separation unambiguous. Without a separator, our only option is to split on position, but that requires the splitting position to be a fixed distance from the start or end of the string.
In your example, the id is always the last single character, so we can pass -1 to the sep argument of separate to split off the last character as the ID column. If that wasn't always true, you would need to some up with a more complex solution to resolve the rownames.
Once you have converted the rownames into a "position" and "id" column, it's a simple matter to use spread to spread the position column into the wide format:
library(tidyverse)
df %>%
rownames_to_column('row') %>%
separate(row, into = c('num', 'id'), sep = -1) %>%
spread(num, data)
id first second
1 a 1 5
2 b 2 6
3 c 3 7
4 d 4 8
If row ids could be of variable length, the above solution wouldn't work. If you have a known and limited number of "position" values, you could use a regex solution to split the rowname:
Here, we extract the position value by matching to a regex containing all possible values (| is the OR operator).
We match the "id" value by putting that same regex in a positive lookahead operator. This regex will match 1 or more lowercase letters that come immediately after a match to the position value. The downside of this approach is that you need to specify all possible values of "position" in the regex -- if there are many options, this could quickly become too long and difficult to maintain:
df2
data
firsta 1
firstb 2
firstc 3
firstd 4
seconda 5
secondb 6
secondc 7
secondd 8
secondee 9
df2 %>%
rownames_to_column('row') %>%
mutate(num = str_extract(row, 'first|second'),
id = str_match(row, '(?<=first|second)[a-z]+')) %>%
select(-row) %>%
spread(num, data)
id first second
1 a 1 5
2 b 2 6
3 c 3 7
4 d 4 8
5 ee NA 9
Edit
This question seems to be a duplicate of the question How to group a vector into a list of vectors?, and the answer split(df$b, df$id) was suggested. First happy with the solution, I realized that the given answers do not fully address my question. In the below question, I would like to obtain a list in which the vector elements are assigned to the value of a third column (in my example df$a). This is important, as otherwise the order of df$b plays a role. I mean obviously I can arrange by df$a and then call split(), but maybe there is another way of doing that.
My sample df:
df <- data_frame(id = paste0('id',rep(1:2, each = 5)), a = rep(letters[1:5],2),b=c(1:5,5:1))
Df should be grouped by ID (in df$id). I would like to create a list of vectors for each group (id) element that contains the values of df$b. My approach
require(tidyr)
spread_df <- df %>% spread(id,b) #makes new columns for each id
#loop over spread_df
for (i in 1:length(spread_df)) {
list_group_elements [i]<- list(spread_df[[i]])
#I want each vector to be identified by the identifier of column df$a
#therefore:
names(list_group_elements[[i]]) <- list_group_elements[[1]]
}
This results in :
list_group_elements
[[1]]
a b c d e
"a" "b" "c" "d" "e"
[[2]]
a b c d e
1 2 3 4 5
[[3]]
a b c d e
5 4 3 2 1
I don't need the first element of the list, but the rest is basically what I need. I have the peculiar impression that my approach is somewhat not ideal and if someone has an idea to improve this, (e.g., with dplyr?) this would be highly appreciated. Why do I want this: I made a function that uses vectors as arguments and I would like to run this function over certain columns from dataframes - but only using the grouped values as arguments and not the entire column.
You may make df$b a named vector using setNames, and then split it into a list:
split(setNames(df$b, df$a), df$id)
# $id1
# a b c d e
# 1 2 3 4 5
#
# $id2
# a b c d e
# 5 4 3 2 1
One way is
lapply(levels(df$id), function(L) df$b[df$id == L])
[[1]]
[1] 1 2 3 4 5
[[2]]
[1] 5 4 3 2 1
Consider by, object-oriented wrapper of tapply, designed to split dataframe by factor(s):
by(df, df$id, FUN=function(i) i$b)
ultimately, I need to search column 1 of my data frame and if I find a match to var, then I need to get the element next to it in the same row (the next column over)
I thought I could usematch or %in% to find the index in a vector I got from the data frame, but I keep getting NA
one example I looked at is
Is there an R function for finding the index of an element in a vector?, and I don't understand why my code is any different from the answers.
so in my code below, if I find b in column 1, I eventually want to get 20 from the data frame.
What am I doing wrong? I would preferr to stick to r-base packages if possible
> df = data.frame(names = c("a","b","c"),weight = c(10,20,30),height=c(5,10,15))
> df
names weight height
1 a 10 5
2 b 20 10
3 c 30 15
> vect = df[1]
> vect
names
1 a
2 b
3 c
> match("b", vect)
[1] NA
Now I have a .df looks like below:
v1 v2 v3
1 2 3
4 5 6
What should I do with rownames such that if v2 of rownames(df) %% 2 == 0 does not equal to v2 of rownames(df) %% 2 == 1, then delete both rows?
Thank you all.
Update:
For this df below, you can see that for row 1 and 2, they have the same ID, so I want to keep these two rows as a pair (CODE shows 1 and 4).
Similarly I want to keep row 10 and 11 because they have the same ID and they are a pair.
What should I do to get a new df?
1) Create a dataframe with column for number of times the id comes
library(sqldf)
df2=sqldf("select count(id),id from df group by id"
2) merge them
df3=merge(df1,df2)
3) select only if count>1
df3[df3$count>1,]
If what you are looking for is to keep paired IDs and delete the rest (I doubt it is as simple as this), then ..
Extract your ids: I have written them out, you should extract.
id = c(263733,263733,2913733,3243733,3723733,4493733,273733,393733,2953733,3583733,3583733)
sort them
Find out which ones to keep.
id1 = cbind(id[1:length(id)-1],id[2:length(id)])
chosenID = id1[which(id1[,1]==id1[,2]),1]
And then extract from your df those rows that have chosenID.
I need to turn these two matrices corresponding to (toy) word counts:
a hope to victory win
[1,] 2 1 1 1 1
and
a chance than win
[1,] 1 1 1 1
where the word "a" appears a combined number of 3 times, and the word "win" appears 2 times (once in each matrix), into:
a win chance hope than to victory
[1,] 3 2 1 1 1 1 1
where equally-named columns combine into a single column that contains the sum.
And,
a hope to victory win different than
[1,] 2 1 1 1 1 0 0
where first matrix is preserved, and the second matrix is attached at the end but with only unique column names and with all the row values equal to zero.
So, if you store this data in a data frame (Which is really recommended for this sort of data) the process is very simple.
(I'm including a conversion from that format, with any number of rows):
conversion:
newdf1 <- data.frame(Word = colnames(matrix1), Count = as.vector(t(matrix1)))
newdf2 <- data.frame(Word = colnames(matrix2), Count = as.vector(t(matrix2)))
now you can use rbind + dplyr (or data.table)
dplyr solution:
library(dplyr)
df <- rbind(newdf1,newdf2)
result <- df %>% group_by(Word) %>% summarise(Count = sum(Count))
the answer to your second question is related,
result2 <- rbind(newdf1,data.frame(Word = setdiff(newdf2$Word,newdf1$Word), Count = 0))
(the data.table solution is very similar, but if you're new to data frames and grouping/reshaping, I recommend dplyr)
(EDITED the second solution so that it's actually giving you the unique entries)