apply function by name of list - r

Imagine that I have a list
l <- list("a" = 1, "b" = 2)
and a data frame
id value
a 3
b 4
I want to match id with list names, and apply a function on that list with the value in data frame. For example, I want the sum of value in the data frame and corresponding value in the list, I get
id value
a 4
b 6
Anyone has a clue?
Edit:
A.
I just want to expand the question a little bit with. Now, I have more than one value in every elements of list.
l <- list("a" = c(1, 2), "b" =c(1, 2))
I still want the sum
id value
a 6
b 7

We can match the names of the list with id of dataframe, unlist the list accordingly and add it to value
df$value <- unlist(l[match(df$id, names(l))]) + df$value
df
# id value
#1 a 4
#2 b 6
EDIT
If we have multiple entries in list we need to sum every list after matching. We can do
df$value <- df$value + sapply(l[match(df$id, names(l))], sum)
df
# id value
#1 a 6
#2 b 7

You just need
df$value=df$value+unlist(l)[df$id]# vector have names can just order by names
df
id value
1 a 4
2 b 6
Try answer with Ronak
l <- list("b" = 2, "a" = 1)
unlist(l)[as.character(df$id)]# if you id in df is factor
a b
1 2
Update
df$value=df$value+unlist(lapply(l,sum))[df$id]

Related

Call apply-like function on two rows to match

I have a dataframe with multiple rows. I want to call a function is using any two rows. For example, Let's say I have this data and this myFunc which accepts two args:
df <- data.frame(q1=c(1,2,5), q2=c(5,5,5), q3=c(5,2,5), q4=c(5,5,5), q5=c(2,3,1))
df
q1 q2 q3 q4 q5
1 1 5 5 5 2
2 2 5 2 5 3
3 5 5 5 5 1
myFunc<-function(a,b) sum((df[a,]==df[b,] & df[a,]==5)*1)
A want to apply myFunc for row 1 and 2, myFunc(1,2) and I expect 2, myFunc compute how many "5" are have in common under the same column, between row 1 and 2.
Since I have thousands of rows, and I want to match all pairs, I want do this without writing a for loop, maybe with the do call or apply function family.
I tried this:
a=c(1,2) # match the row 1 and 2
b=c(2,3) # match the row 2 and 3
my_list=list(a,b)
do.call("myFunc", my_list)
But I got 4, instead of 2 and 2, any ideas?
The question recently changed. My understanding of it is that the input should be a list of pairs of row numbers and the output should be the same length as that list such that each component of the output is the number of columns with both entries equal to 5 in both rows defined by the corresponding pair. Thus for df shown in the question the list L shown below would correspond to c(myFunc(1, 2), myFunc(2, 3)) where myFunc is as defined in the question.
L <- list(1:2, 2:3)
myFunc2 <- function(x) myFunc(x[1], x[2])
sapply(L, myFunc2)
## [1] 2 2
Note that *1 in myFunc is unnecessary since sum will coerce a logical argument to numeric.
An alternative might be to specify the first row numbers as a vector and the second row numbers as another vector. In terms of L that would be a <- sapply(L, "[", 1); b <- sapply(L, "[", 2). Then use mapply.
a <- c(1, 2) # L[[1]][1], L[[2]][1]
b <- c(2, 3) # L[[1]][2], L[[2]][2]
mapply(myFunc, a, b)
## [1] 2 2
Try passing the rows instead of the row index
df <- data.frame(q1=c(1,2,5), q2=c(5,5,5), q3=c(5,2,5), q4=c(5,5,5), q5=c(2,3,1))
myFunc<-function(a,b) sum((a==b & a==5)*1)
myFunc(df[1,],df[2,])
This worked for me (returned 2)

How to look for uniques in other column relatively assign ids

I have a toy example to explain what I am trying to work on :
aski = data.frame(x=c("a","b","c","a","d","d"),y=c("b","a","d","a","b","c"))
I managed to do assigning unique ids to column y and now output looks like:
aski2 = data.frame(x=c("a","b","c","a","d","d"),y=c("1","2","3","2","1","4"))
as you see "b" is present in both col x and y and we assigned an id=1 in col y
and "a" with id=2 in col y and so on..
As you see these values are also present in col x.....
col x has "a" as its first element ."a" was also in col y and assigned an id=2
so I'll assign an id=2 for a in col x also
Now what i m trying to do next is look for these values in col x and if it occurs in col y I assign that id to it
FINAL DATAFRAME LIKE
aski3 = data.frame(x=c("2","1","4","2","3","3"),y=c("1","2","3","2","1","4"))
Without the need to create aski2 as an intermediate, a possible solution is to use match with lapply to get the numeric representations of the letters:
# create a vector of the unique values in the order
# in which you want them assigned to '1' till '4'
v <- unique(aski$y)
# convert both columns to integer values with 'match' and 'lapply'
aski[] <- lapply(aski, match, v)
which gives:
> aski
x y
1 2 1
2 1 2
3 4 3
4 2 2
5 3 1
6 3 4
If you want the number as characters, you can additionally do:
aski[] <- lapply(aski, as.character)
First, convert both columns to character vectors.
Then, collect all unique values from the two columns to use as levels of a factor.
Convert both columns to factors, then numeric.
aski = data.frame(x=c("a","b","c","a","d","d"),y=c("b","a","d","a","b","c"))
aski$x <- as.character(aski$x)
aski$y <- as.character(aski$y)
lev <- unique(c(aski$y, aski$x))
aski$x <- factor(aski$x, levels=lev)
aski$y <- factor(aski$y, levels=lev)
aski$x <- as.numeric(aski$x)
aski$y <- as.numeric(aski$y)
aski
A solution from dplyr. We can first create a vector showing the relationship between index and letter as vec by unique(aski$y). After this step, you can use Jaap's lapply solution, or you can use mutata_all from dplyr as follows.
# Create the vector showing the relationship of index and letter
vec <- unique(aski$y)
# View vec
vec
[1] "b" "a" "d" "c"
library(dplyr)
# Modify all columns
aski2 <- aski %>% mutate_all(funs(match(., vec)))
# View the results
aski2
x y
1 2 1
2 1 2
3 4 3
4 2 2
5 3 1
6 3 4
Data
aski <- data.frame(x = c("a","b","c","a","d","d"),
y = c("b","a","d","a","b","c"),
stringsAsFactors = FALSE)

Removing rows in data.frame having columns subsumed in others

I am trying to achieve something similar to unique in a data.frame where column each element of a column in a row are vectors. What I want to do is if the elements of the vector in the column of that hat row a subset or equal to another remove the row with smaller number of elements. I can achieve this with a nested for loop but since data contains 400,000 rows the program is very inefficient.
Sample data
# Set the seed for reproducibility
set.seed(42)
# Create a random data frame
mydf <- data.frame(items = rep(letters[1:4], length.out = 20),
grps = sample(1:5, 20, replace = TRUE),
supergrp = sample(LETTERS[1:4], replace = TRUE))
# Aggregate items into a single column
temp <- aggregate(items ~ grps + supergrp, mydf, unique)
# Arrange by number of items for each grp and supergroup
indx <- order(lengths(temp$items), decreasing = T)
temp <- temp[indx, ,drop=FALSE]
Temp looks like
grps supergrp items
1 4 D a, c, d
2 3 D c, d
3 5 D a, d
4 1 A b
5 2 A b
6 3 A b
7 4 A b
8 5 A b
9 1 D d
10 2 D c
Now you can see that second combination of supergrp and items in second and third row is contained in first row. So, I want to delete the second and third rows from the result. Similarly, rows 5 to 8 are contained in row 4. Finally, rows 9 and 10 are contained in the first row, so I want to delete rows 9 and 10.
Hence, my result would look like:
grps supergrp items
1 4 D a, c, d
4 1 A b
My implementation is as follows::
# initialise the result dataframe by first row of old data frame
newdf <-temp[1, ]
# For all rows in the the original data
for(i in 1:nrow(temp))
{
# Index to check if all the items are found
indx <- TRUE
# Check if item in the original data appears in the new data
for(j in 1:nrow(newdf))
{
if(all(c(temp$supergrp[[i]], temp$items[[i]]) %in%
c(newdf$supergrp[[j]], newdf$items[[j]]))){
# set indx to false if a row with same items and supergroup
# as the old data is found in the new data
indx <- FALSE
}
}
# If none of the rows in new data contain items and supergroup in old data append that
if(indx){
newdf <- rbind(newdf, temp[i, ])
}
}
I believe there is an efficient way to implement this in R; may be using the tidy framework and dplyr chains but I am missing the trick. Apologies for a longish question. Any input would be highly appreciated.
I would try to get the items out of a list column and store them in a longer dataframe. Here is my somewhat hacky solution:
library(stringr)
items <- temp$items %>%
map(~str_split(., ",")) %>%
map_df(~data.frame(.))
out <- bind_cols(temp[, c("grps", "supergrp")], items)
out %>%
gather(item_name, item, -grps, -supergrp) %>%
select(-item_name, -grps) %>%
unique() %>%
filter(!is.na(item))

Finding unique tuples in R but ignoring order

Since my data is much more complicated, I made a smaller sample dataset (I left the reshape in to show how I generated the data).
set.seed(7)
x = rep(seq(2010,2014,1), each=4)
y = rep(seq(1,4,1), 5)
z = matrix(replicate(5, sample(c("A", "B", "C", "D"))))
temp_df = cbind.data.frame(x,y,z)
colnames(temp_df) = c("Year", "Rank", "ID")
head(temp_df)
require(reshape2)
dcast(temp_df, Year ~ Rank)
which results in...
> dcast(temp_df, Year ~ Rank)
Using ID as value column: use value.var to override.
Year 1 2 3 4
1 2010 D B A C
2 2011 A C D B
3 2012 A B D C
4 2013 D A C B
5 2014 C A B D
Now I essentially want to use a function like unique, but ignoring order to find where the first 3 elements are unique.
Thus in this case:
I would have A,B,C in row 5
I would have A,B,D in rows 1&3
I would have A,C,D in rows 2&4
Also I need counts of these "unique" events
Also 2 more things. First, my values are strings, and I need to leave them as strings.
Second, if possible, I would have a column between year and 1 called Weighting, and then when counting these unique combinations I would include each's weighting. This isn't as important because all weightings will be small positive integer values, so I can potentially duplicate the rows earlier to account for weighting, and then tabulate unique pairs.
You could do something like this:
df <- dcast(temp_df, Year ~ Rank)
combos <- apply(df[, 2:4], 1, function(x) paste0(sort(x), collapse = ""))
combos
# 1 2 3 4 5
# "BCD" "ABC" "ACD" "BCD" "ABC"
For each row of the data frame, the values in columns 1, 2, and 3 (as labeled in the post) are sorted using sort, then concatenated using paste0. Since order doesn't matter, this ensures that identical cases are labeled consistently.
Note that the paste0 function is equivalent to paste(..., sep = ""). The collapse argument says to concatenate the values of a vector into a single string, with vector values separated by the value passed to collapse. In this case, we're setting collapse = "", which means there will be no separation between values, resulting in "ABC", "ACD", etc.
Then you can get the count of each combination using table:
table(combos)
# ABC ACD BCD
# 2 1 2
This is the same solution as #Alex_A but using tidyverse functions:
library(purrr)
library(dplyr)
df <- dcast(temp_df, Year ~ Rank)
distinct(df, ID = pmap_chr(select(df, num_range("", 1:3)),
~paste0(sort(c(...)), collapse="")))

How do you test if a pair of elements is in a data frame?

Let's say I have this data frame A :
A = data.frame(first=c("a", "b","c", "d"), second=c(1, 2, 3, 4))
first second
1 a 1
2 b 2
3 c 3
4 d 4
And I have this data frame B :
B = data.frame(first=c("x", "a", "c"), second=c(1, 4, 3))
first second
1 x 1
2 a 4
3 c 3
I want to count the number of times a pair of the data frame B (B$first, B$second) is in the data frame A. The counting part is not the problem, I just can't find the function to determine whether a pair is in a data frame.
The result would be that only c("c",3) is an element of A, so it should be 1. both "a" and 4 are in data frame A, but the couple c("a", 4) does not exist in data frame A, so I don't want to count this. I want the exact match.
I'm looking for a function like %in% that could work for pairs.
Thanks for your help
Maybe something like this
apply(B, 1, function(r, A){ sum(A$first==r[1] & A$second==r[2]) }, A)
Basically, what it does is the following: for every row of B it applies a function that inspects which elements of A are in accordance with row r from B (part A$first==r[1] & A$second==r[2]) and then sums obtained logicals to derive the number of rows in A that are in accordance with row r.
If you also want grouping it can easily be done with dplyr like this
cbind(B,tmp) %.% group_by(first,second) %.% summarise(n=max(tmp))
where tmp is a variable representing the result of the aforementioned apply
Here's an alternative: rbind your data.frames together and use duplicated.
AB <- do.call(rbind, mget(c("A", "B")))
AB$ind <- as.numeric(duplicated(AB))
AB[grep("^B", rownames(AB)), ]
# first second ind
# B.1 x 1 0
# B.2 a 4 0
# B.3 c 3 1
You can also probably try to use "digest" to generate a hash for each row, but I'm not sure how efficient this would be:
library(digest)
Reduce(function(x, y) y %in% x,
lapply(mget(c("A", "B")), function(x)
apply(x, 1, digest)))
# [1] FALSE FALSE TRUE
An alternative is to merge by row, e.g. mB<-apply(B,1,function(j) paste0(j[1],"_",j[2]) and similarly for A at which point you can loop mB[j]%in%mA[k]
Not that I would really recommend doing this :-)

Resources