Subsetting data from a dataframe and taking specific values from the subsetted values - r

I want to check if values (in example below "letters") in 1 dataframe appear in another dataframe. And if that is the case, I want a value (in example below "ranking") which is specific for that value from the first dataframe to be added to the second dataframe... What I have now Is the following:
Df1 <- data.frame(c("A", "C", "E"), c(1:3))
colnames(Df1) <- c("letters", "ranking")
Df2 <- data.frame(c("A", "B", "C", "D", "E"))
colnames(Df2) <- c("letters")
Df2$rank <- ifelse(Df2$letters %in% Df1$letters, 1, 0)
However... Instead of getting a '1' when the letters overlap, I want to get the specific 'ranking' number from Df1.
Thanks!

What you're looking for is called a merge:
merge(Df2, Df1, by="letters", all.x=TRUE)
Also, fun fact, you can create a dataframe and name the columns at the same time (and you'll usually want to "turn off" strings as factors):
df1 <- data.frame(
letters = c("a", "b", "c"),
ranking = 1:3,
stringsAsFactors = FALSE)

dplyr package is best for this.
Df2 <- Df2 %>%
left_join(Df1,by = "letters")
this will show a NA for "D" if you want to keep it.
Otherwise you can do semi_join
DF2 <- Df2 %>%
semi_join(Df1, by = "letters")
And this will only keep the ones they have in common (intersection)

Related

create levels for data frame column explicitly

Quite often, data frames are created based on some raw data and these "down stream" data frames' factor levels may not be aligned. Is the following, the correct way to created a level x, which may exist in another data frame with the same signature?
df <- data.frame(
c1 = c("a", "a", "b", "c")
)
df
str(df)
df$c1 <- factor(as.character(df$c1), ordered = FALSE, levels = c("c", "a", "b", "x"))
df
str(df)

Subset a Data Frame Based on All Combinations and Sub-combinations of Factor Variables

I need to subset a data.frame based on all combinations an sub-combinations of multiple columns of factor variables. Additionally the number of columns factor variables may change so the method needs to be flexible in accepting different numbers of attributes. I can figure out how to create the combinations of variables in a simple example but don't have a good way to subset the data.frame efficiently. Any thoughts?
#setup an example data.frame
a <- c("a", "b", "b", "b", "e")
b <- c("b", "c", "b", "b", "f")
c <- c("c", "d", "b", "b", "g")
df <- data.table(a = a, b = b, c = c)
#build a data.frame of unique combos to subset on
df_unique <- df[!duplicated(df), ]
df_combos <- data.table()
for(i in 1:ncol(df_unique)){
for(x in 1:ncol(df_unique)){
df_sub <- df_unique[,i:x, with = F]
df_combos <- rbind(df_combos, df_sub, fill = T)
}
}
df_combos <- df_combos[!duplicated(df_combos), ]
rm(df_unique)
#create a loop to build the subsets
combos_out <- data.table()
for(i in 1:nrow(df_combos)){
df_combos_sub <- df_combos[i, ]
df_combos_sub <- df_combos_sub[,which(unlist(lapply(df_combos_sub, function(x)!all(is.na(x))))),with=F]
df_sub <- merge(df, df_combos_sub, by = colnames(df_combos_sub))
#interesting code here that performs analysis on the subsets
}

Most efficient to append some columns of a data frame to some other columns

Suppose I have the following data frame:
foo <- data.frame(a=letters,b=seq(1,26),
n1=rnorm(26),n2=rnorm(26),
u1=runif(26),u2=runif(26))
I want to append columns u1 and u2 to columns n1 and n2. For now, I found the following way:
df1 <- foo[,c("a","b","n1","n2")]
df2 <- foo[,c("a","b","u1","u2")]
names(df2) <- names(df1)
bar <- rbind(df1,df2)
That does the trick. However, it seems a little bit involved. Am I too picky? Or is there a faster/simpler way to do this in R?
Here is one way using full_join() from dplyr:
library(dplyr)
full_join(df1, df2, by = c("a", "b", "n1" = "u1", "n2" = "u2"))
From the documentation:
full_join
return all rows and all columns from both x and y. Where
there are not matching values, returns NA for the one missing.
by
a character vector of variables to join by. If NULL, the default,
join will do a natural join, using all variables with common names
across the two tables. A message lists the variables so that you can
check they're right.
To join by different variables on x and y use a named vector. For
example, by = c("a" = "b") will match x.a to y.b.
Use Map() to concatenate the columns, and cbind() with recycling to arrive at the final data frame.
cbind(foo[1:2], Map(c, foo[3:4], foo[5:6]))
Substitute numerical indexes with column names, if desired.
cbind(foo[c("a", "b")], Map(c, foo[c("n1", "n2")], foo[c("u1", "u2")]))
Short-hand:
rbind(foo[1:4], setNames(foo[c(1, 2, 5, 6)], names(foo[1:4])))
Long-winded:
rbind(foo[c("a", "b", "n1", "n2")], setNames(foo[c("a", "b", "u1", "u2")], c("a", "b", "n1", "n2")))
Long-winded (more DRY):
nms <- c("a", "b", "n1", "n2")
rbind(foo[nms], setNames(foo[c("a", "b", "u1", "u2")], nms))

Preserving zero length groups with aggregate

I just noticed that aggregate disappears empty groups from the result, how can I solve this? e.g.
`xx <- c("a", "b", "d", "a", "d", "a")
xx <- factor(xx, levels = c("a", "b", "c", "d"))
y <- rnorm(60, 5, 1)
z <- matrix(y, 6, 10)
aggregate(z, by = list(groups = xx), sum)`
xx is a factor variable with 4 levels, but the result gives just 3 rows, and would like a row for the "c" level with zeros. I would like the same behavior of table(xx) tha gives frecuencies even for levels with no observations.
We can create another data.frame with just the levels of 'xx' and then merge with the aggregate. The output will have all the 'groups' while the row corresponding to the missing level for the other columns will be NA.
merge(data.frame(groups=levels(xx)),
aggregate(z, by = list(groups = xx), sum), all.x=TRUE)
Another option might be to convert to 'long' format with melt and then use dcast with fun.aggregate as 'sum' and drop=FALSE
library(data.table)
dcast(melt(data.table(groups=xx, z), id.var='groups'),
groups~variable, value.var='value', sum, drop=FALSE)

Select and count the number of duplicate items with two different outcome values?

Long-time follower, thanks so much for all your help over the years! I have a question that might have an easy answer, but I failed in googling it, and trying various subsetting and bracket notation also feel short. I'm betting someone here has encountered a similar problem.
I have a long-form data set with a set of duplicate ids. I also have a third variable that might be different for the duplicate. By example, if you recreate my data set:
x <- c("a", "a", "b", "c", "c", "d", "d", "d")
y <- c("z", "z", "z", "y", "y", "y", "x", "x")
z <- c(10, 20, 10, 10, 10, 10, 10, 20)
df <- cbind(x, y, z)
df <- as.data.frame(df)
names(df) <- c("id1", "id2", "var1")
df
I want to select the rows in which id2 has BOTH a 10 and 20 when they are connected to the same id1, For example, 'x' has two observations connected to id1 ('a') with two different var1 values (a '10' and a '20).
I want to select these cases, as well as count how many cases like this are in the overall data set. Thanks in advance!
One way is with ddply from the plyr package. Something like this:
> library(plyr)
> ddply(df, c('id2', 'id1'), function(x) if(length(unique(x$var1))==2) x)
id1 id2 var1
1 d x 10
2 d x 20
3 a z 10
4 a z 20

Resources