Is there a lookup function in R like vlookup in excel? [duplicate] - r
I have a dataset about 105000 rows and 30 columns. I have a categorical variable that I would like to assign it to a number. In Excel, I would probably do something with VLOOKUP and fill.
How would I go about doing the same thing in R?
Essentially, what I have is a HouseType variable, and I need to calculate the HouseTypeNo. Here are some sample data:
HouseType HouseTypeNo
Semi 1
Single 2
Row 3
Single 2
Apartment 4
Apartment 4
Row 3
If I understand your question correctly, here are four methods to do the equivalent of Excel's VLOOKUP and fill down using R:
# load sample data from Q
hous <- read.table(header = TRUE,
stringsAsFactors = FALSE,
text="HouseType HouseTypeNo
Semi 1
Single 2
Row 3
Single 2
Apartment 4
Apartment 4
Row 3")
# create a toy large table with a 'HouseType' column
# but no 'HouseTypeNo' column (yet)
largetable <- data.frame(HouseType = as.character(sample(unique(hous$HouseType), 1000, replace = TRUE)), stringsAsFactors = FALSE)
# create a lookup table to get the numbers to fill
# the large table
lookup <- unique(hous)
HouseType HouseTypeNo
1 Semi 1
2 Single 2
3 Row 3
5 Apartment 4
Here are four methods to fill the HouseTypeNo in the largetable using the values in the lookup table:
First with merge in base:
# 1. using base
base1 <- (merge(lookup, largetable, by = 'HouseType'))
A second method with named vectors in base:
# 2. using base and a named vector
housenames <- as.numeric(1:length(unique(hous$HouseType)))
names(housenames) <- unique(hous$HouseType)
base2 <- data.frame(HouseType = largetable$HouseType,
HouseTypeNo = (housenames[largetable$HouseType]))
Third, using the plyr package:
# 3. using the plyr package
library(plyr)
plyr1 <- join(largetable, lookup, by = "HouseType")
Fourth, using the sqldf package
# 4. using the sqldf package
library(sqldf)
sqldf1 <- sqldf("SELECT largetable.HouseType, lookup.HouseTypeNo
FROM largetable
INNER JOIN lookup
ON largetable.HouseType = lookup.HouseType")
If it's possible that some house types in largetable do not exist in lookup then a left join would be used:
sqldf("select * from largetable left join lookup using (HouseType)")
Corresponding changes to the other solutions would be needed too.
Is that what you wanted to do? Let me know which method you like and I'll add commentary.
I think you can also use match():
largetable$HouseTypeNo <- with(lookup,
HouseTypeNo[match(largetable$HouseType,
HouseType)])
This still works if I scramble the order of lookup.
I also like using qdapTools::lookup or shorthand binary operator %l%. It works identically to an Excel vlookup, but it accepts name arguments opposed to column numbers
## Replicate Ben's data:
hous <- structure(list(HouseType = c("Semi", "Single", "Row", "Single",
"Apartment", "Apartment", "Row"), HouseTypeNo = c(1L, 2L, 3L,
2L, 4L, 4L, 3L)), .Names = c("HouseType", "HouseTypeNo"),
class = "data.frame", row.names = c(NA, -7L))
largetable <- data.frame(HouseType = as.character(sample(unique(hous$HouseType),
1000, replace = TRUE)), stringsAsFactors = FALSE)
## It's this simple:
library(qdapTools)
largetable[, 1] %l% hous
The poster didn't ask about looking up values if exact=FALSE, but I'm adding this as an answer for my own reference and possibly others.
If you're looking up categorical values, use the other answers.
Excel's vlookup also allows you to match match approximately for numeric values with the 4th argument(1) match=TRUE. I think of match=TRUE like looking up values on a thermometer. The default value is FALSE, which is perfect for categorical values.
If you want to match approximately (perform a lookup), R has a function called findInterval, which (as the name implies) will find the interval / bin that contains your continuous numeric value.
However, let's say that you want to findInterval for several values. You could write a loop or use an apply function. However, I've found it more efficient to take a DIY vectorized approach.
Let's say that you have a grid of values indexed by x and y:
grid <- list(x = c(-87.727, -87.723, -87.719, -87.715, -87.711),
y = c(41.836, 41.839, 41.843, 41.847, 41.851),
z = (matrix(data = c(-3.428, -3.722, -3.061, -2.554, -2.362,
-3.034, -3.925, -3.639, -3.357, -3.283,
-0.152, -1.688, -2.765, -3.084, -2.742,
1.973, 1.193, -0.354, -1.682, -1.803,
0.998, 2.863, 3.224, 1.541, -0.044),
nrow = 5, ncol = 5)))
and you have some values you want to look up by x and y:
df <- data.frame(x = c(-87.723, -87.712, -87.726, -87.719, -87.722, -87.722),
y = c(41.84, 41.842, 41.844, 41.849, 41.838, 41.842),
id = c("a", "b", "c", "d", "e", "f")
Here is the example visualized:
contour(grid)
points(df$x, df$y, pch=df$id, col="blue", cex=1.2)
You can find the x intervals and y intervals with this type of formula:
xrng <- range(grid$x)
xbins <- length(grid$x) -1
yrng <- range(grid$y)
ybins <- length(grid$y) -1
df$ix <- trunc( (df$x - min(xrng)) / diff(xrng) * (xbins)) + 1
df$iy <- trunc( (df$y - min(yrng)) / diff(yrng) * (ybins)) + 1
You could take it one step further and perform a (simplistic) interpolation on the z values in grid like this:
df$z <- with(df, (grid$z[cbind(ix, iy)] +
grid$z[cbind(ix + 1, iy)] +
grid$z[cbind(ix, iy + 1)] +
grid$z[cbind(ix + 1, iy + 1)]) / 4)
Which gives you these values:
contour(grid, xlim = range(c(grid$x, df$x)), ylim = range(c(grid$y, df$y)))
points(df$x, df$y, pch=df$id, col="blue", cex=1.2)
text(df$x + .001, df$y, lab=round(df$z, 2), col="blue", cex=1)
df
# x y id ix iy z
# 1 -87.723 41.840 a 2 2 -3.00425
# 2 -87.712 41.842 b 4 2 -3.11650
# 3 -87.726 41.844 c 1 3 0.33150
# 4 -87.719 41.849 d 3 4 0.68225
# 6 -87.722 41.838 e 2 1 -3.58675
# 7 -87.722 41.842 f 2 2 -3.00425
Note that ix, and iy could have also been found with a loop using findInterval, e.g. here's one example for the second row
findInterval(df$x[2], grid$x)
# 4
findInterval(df$y[2], grid$y)
# 2
Which matches ix and iy in df[2]
Footnote:
(1) The fourth argument of vlookup was previously called "match", but after they introduced the ribbon it was renamed to "[range_lookup]".
Solution #2 of #Ben's answer is not reproducible in other more generic examples. It happens to give the correct lookup in the example because the unique HouseType in houses appear in increasing order. Try this:
hous <- read.table(header = TRUE, stringsAsFactors = FALSE, text="HouseType HouseTypeNo
Semi 1
ECIIsHome 17
Single 2
Row 3
Single 2
Apartment 4
Apartment 4
Row 3")
largetable <- data.frame(HouseType = as.character(sample(unique(hous$HouseType), 1000, replace = TRUE)), stringsAsFactors = FALSE)
lookup <- unique(hous)
Bens solution#2 gives
housenames <- as.numeric(1:length(unique(hous$HouseType)))
names(housenames) <- unique(hous$HouseType)
base2 <- data.frame(HouseType = largetable$HouseType,
HouseTypeNo = (housenames[largetable$HouseType]))
which when
unique(base2$HouseTypeNo[ base2$HouseType=="ECIIsHome" ])
[1] 2
when the correct answer is 17 from the lookup table
The correct way to do it is
hous <- read.table(header = TRUE, stringsAsFactors = FALSE, text="HouseType HouseTypeNo
Semi 1
ECIIsHome 17
Single 2
Row 3
Single 2
Apartment 4
Apartment 4
Row 3")
largetable <- data.frame(HouseType = as.character(sample(unique(hous$HouseType), 1000, replace = TRUE)), stringsAsFactors = FALSE)
housenames <- tapply(hous$HouseTypeNo, hous$HouseType, unique)
base2 <- data.frame(HouseType = largetable$HouseType,
HouseTypeNo = (housenames[largetable$HouseType]))
Now the lookups are performed correctly
unique(base2$HouseTypeNo[ base2$HouseType=="ECIIsHome" ])
ECIIsHome
17
I tried to edit Bens answer but it gets rejected for reasons I cannot understand.
Starting with:
houses <- read.table(text="Semi 1
Single 2
Row 3
Single 2
Apartment 4
Apartment 4
Row 3",col.names=c("HouseType","HouseTypeNo"))
... you can use
as.numeric(factor(houses$HouseType))
... to give a unique number for each house type. You can see the result here:
> houses2 <- data.frame(houses,as.numeric(factor(houses$HouseType)))
> houses2
HouseType HouseTypeNo as.numeric.factor.houses.HouseType..
1 Semi 1 3
2 Single 2 4
3 Row 3 2
4 Single 2 4
5 Apartment 4 1
6 Apartment 4 1
7 Row 3 2
... so you end up with different numbers on the rows (because the factors are ordered alphabetically) but the same pattern.
(EDIT: the remaining text in this answer is actually redundant. It occurred to me to check and it turned out that read.table() had already made houses$HouseType into a factor when it was read into the dataframe in the first place).
However, you may well be better just to convert HouseType to a factor, which would give you all the same benefits as HouseTypeNo, but would be easier to interpret because the house types are named rather than numbered, e.g.:
> houses3 <- houses
> houses3$HouseType <- factor(houses3$HouseType)
> houses3
HouseType HouseTypeNo
1 Semi 1
2 Single 2
3 Row 3
4 Single 2
5 Apartment 4
6 Apartment 4
7 Row 3
> levels(houses3$HouseType)
[1] "Apartment" "Row" "Semi" "Single"
You could use mapvalues() from the plyr package.
Initial data:
dat <- data.frame(HouseType = c("Semi", "Single", "Row", "Single", "Apartment", "Apartment", "Row"))
> dat
HouseType
1 Semi
2 Single
3 Row
4 Single
5 Apartment
6 Apartment
7 Row
Lookup / crosswalk table:
lookup <- data.frame(type_text = c("Semi", "Single", "Row", "Apartment"), type_num = c(1, 2, 3, 4))
> lookup
type_text type_num
1 Semi 1
2 Single 2
3 Row 3
4 Apartment 4
Create the new variable:
dat$house_type_num <- plyr::mapvalues(dat$HouseType, from = lookup$type_text, to = lookup$type_num)
Or for simple replacements you can skip creating a long lookup table and do this directly in one step:
dat$house_type_num <- plyr::mapvalues(dat$HouseType,
from = c("Semi", "Single", "Row", "Apartment"),
to = c(1, 2, 3, 4))
Result:
> dat
HouseType house_type_num
1 Semi 1
2 Single 2
3 Row 3
4 Single 2
5 Apartment 4
6 Apartment 4
7 Row 3
Using merge is different from lookup in Excel as it has potential to duplicate (multiply) your data if primary key constraint is not enforced in lookup table or reduce the number of records if you are not using all.x = T.
To make sure you don't get into trouble with that and lookup safely, I suggest two strategies.
First one is to make a check on a number of duplicated rows in lookup key:
safeLookup <- function(data, lookup, by, select = setdiff(colnames(lookup), by)) {
# Merges data to lookup making sure that the number of rows does not change.
stopifnot(sum(duplicated(lookup[, by])) == 0)
res <- merge(data, lookup[, c(by, select)], by = by, all.x = T)
return (res)
}
This will force you to de-dupe lookup dataset before using it:
baseSafe <- safeLookup(largetable, house.ids, by = "HouseType")
# Error: sum(duplicated(lookup[, by])) == 0 is not TRUE
baseSafe<- safeLookup(largetable, unique(house.ids), by = "HouseType")
head(baseSafe)
# HouseType HouseTypeNo
# 1 Apartment 4
# 2 Apartment 4
# ...
Second option is to reproduce Excel behaviour by taking the first matching value from the lookup dataset:
firstLookup <- function(data, lookup, by, select = setdiff(colnames(lookup), by)) {
# Merges data to lookup using first row per unique combination in by.
unique.lookup <- lookup[!duplicated(lookup[, by]), ]
res <- merge(data, unique.lookup[, c(by, select)], by = by, all.x = T)
return (res)
}
baseFirst <- firstLookup(largetable, house.ids, by = "HouseType")
These functions are slightly different from lookup as they add multiple columns.
The lookup package can be used here:
library(lookup)
# reference data
hous <- data.frame(HouseType=c("Semi","Single","Row","Single","Apartment","Apartment","Row"),
HouseTypeNo=c(1,2,3,2,4,4,3))
# new large data with HouseType but no HouseTypeNo
largetable <- data.frame(HouseType = sample(unique(hous$HouseType), 1000, replace = TRUE))
# vector approach
largetable$num1 <- lookup(largetable$HouseType, hous$HouseType, hous$HouseTypeNo)
# dataframe approach
largetable$num2 <- vlookup(largetable$HouseType, hous, "HouseType", "HouseTypeNo")
head(largetable)
# HouseType num1 num2
# 1 Semi 1 1
# 2 Semi 1 1
# 3 Apartment 4 4
# 4 Semi 1 1
# 5 Single 2 2
# 6 Single 2 2
Related
Using lapply with multiple function inputs without nesting
I have a data frame with four columns: two columns indicate participation in a sport, while the other two columns indicate whether the player passed each of their two fitness exams. dat <- data.frame(SOCCER = sample(0:1, 10, replace = T), BASEBALL = sample(0:1, 10, replace = T), TEST_1_PASS = sample(0:1, 10, replace = T), TEST_2_PASS = sample(0:1, 10, replace = T)) I would like to obtain a list containing contingency tables for each sport and exam. I know that I can accomplish this using the following code, which uses nested lapply statements, but this strikes me as inefficient. Can anyone propose a more elegant solution that doesn't use nesting? results <- lapply(c("SOCCER", "BASEBALL"), function(x) { lapply(c("TEST_1_PASS", "TEST_2_PASS"), function(y){ table(sport = dat[[x]], pass = dat[[y]]) }) }) Thanks as always!
The double lapply gets all pairwise combinations of the columns in each of the columns' vectors, like #Gregor wrote in a comment I don't think your nesting is inefficient... you need to call table 4 times. It doesn't really matter if that's inside one loop/lapply over 4 items or 2 nested loops/lapplys with 2 items each. But here is another way, with one of the loops in disguise as expand.grid. cols <- expand.grid(x = c("SOCCER", "BASEBALL"), y = c("TEST_1_PASS", "TEST_2_PASS"), stringsAsFactors = FALSE) Map(function(.x, .y)table(dat[[.x]], dat[[.y]]), cols$x, cols$y)
Consider re-formatting your data into long format (i.e., tidy data), merging the two data pieces of sport and exam, then run by (a rarely used member of apply family as object-oriented wrapper to tapply) for all subset combinations between the two returning a named header report of results: # RESHAPE EACH DATA SECTION (SPORT AND EXAM) INTO LONG FORMAT df_list <- lapply(list(c("SOCCER", "BASEBALL"), c("TEST_1_PASS", "TEST_2_PASS")), function(cols) reshape(cbind(PLAYER = row.names(dat), dat[cols]), varying = cols, v.names = "VALUE", times = cols, timevar = "INDICATOR", idvar = "PLAYER", ids = NULL, new.row.names = 1:1E4, direction = "long") ) # CROSS JOIN (ALL COMBINATION PAIRINGS) final_df <- Reduce(function(x,y) merge(x, y, by="PLAYER", suffixes=c("_SPORT", "_EXAM")), df_list) final_df # RUN TABLES FOR EACH SUBSET COMBINATION tables_list <- with(final_df, by(final_df, list(INDICATOR_SPORT, INDICATOR_EXAM), function(sub) table(sport = sub$VALUE_SPORT, pass = sub$VALUE_EXAM) ) ) Output tables_list # : BASEBALL # : TEST_1_PASS # pass # sport 0 1 # 0 3 4 # 1 2 1 # ------------------------------------------------------------ # : SOCCER # : TEST_1_PASS # pass # sport 0 1 # 0 2 0 # 1 3 5 # ------------------------------------------------------------ # : BASEBALL # : TEST_2_PASS # pass # sport 0 1 # 0 3 4 # 1 1 2 # ------------------------------------------------------------ # : SOCCER # : TEST_2_PASS # pass # sport 0 1 # 0 2 0 # 1 2 6 Online Demo
Finding values in data frame based on another data frame in R
Just starting with R (and coding in general)... I have this issue with calculations where I use two different data frames and I couldn't find a solution. I created a simplified example of my problem: I have two data frames, df_1 and df_2: df_1: Numbers Assigned_color 1 - 2 - 3 - 4 - 5 - 6 - df_2: Value Color 4 Blue 5 Orange 6 Red What I want is to do assign the color for numbers in df_1 based on function which uses values from both data frames. In this example I wish to assign a color if sum of df_1$Number and df_2$Value equals 10. This would lead to following outcome (df_1): Numbers Assigned_color 1 - 2 - 3 - 4 Red 5 Orange 6 Blue The closest I got is this: for(i in 2:nrow(df_1)) { for(j in 2:nrow(df_2)) { df_1$Assigned_color[i] <- ifelse(df_1$Numbers[i] + df_2$Value[j] == 10, df_2$Color[j], df_1$Assigned_color[i]) }} but is doesn't work, because the result I get is this: Numbers Assigned_color 1 - 2 - 3 - 4 Red 5 Orange 6 - ... and I don't understand why. Thank you so much for your patience. EDIT: The real function is much more complex and the dataset is very big. Sorry I should have mentioned this. What I'm actually looking for is script where I could enter any kind of long complicated function and based on value, assign the 'color', something like this: for(i in 2:nrow(df_1)) { for(j in 2:nrow(df_2)) { df_1$Assigned_color[i] <- ifelse [very long function using values from both dataframes] == [desired value], df_2$Color[j], df_1$Assigned_color[i]) }}
If written as an sql query, you can write the condition that they must sum to 10 as the join condition between the two tables, then do an anti-join of the original table with these results to get the non-matched numbers, and rbind the matches and non-matches together. library(sqldf) matches <- sqldf(' select a.Numbers , b.Color as Assigned_color from df_1 a join df_2 b on a.Numbers + b.Value = 10 ') nonmatches <- sqldf(' select a.* from df_1 a left join matches b on a.Numbers = b.Numbers where b.Numbers is NULL ') rbind(nonmatches, matches) # Numbers Assigned_color # 1 1 - # 2 2 - # 3 3 - # 4 4 Red # 5 5 Orange # 6 6 Blue
A base R option would be to get the row/column index of an outer sum after converting to a logical matrix ('m1') and then do the assignment based on the index m1 <- outer(df_2$Value, df_1$Numbers, `+`) == 10 i1 <- which(m1, arr.ind = TRUE) df_1$Assigned_color[i1[,2]] <- df_2$Color[i1[,1]] df_1 # Numbers Assigned_color #1 1 - #2 2 - #3 3 - #4 4 Red #5 5 Orange #6 6 Blue data df_1 <- structure(list(Numbers = 1:6, Assigned_color = c("-", "-", "-", "-", "-", "-")), class = "data.frame", row.names = c(NA, -6L)) df_2 <- structure(list(Value = 4:6, Color = c("Blue", "Orange", "Red" )), class = "data.frame", row.names = c(NA, -3L))
You can use dplyr to join the data and get same result; library(dplyr) df_1 <- data.frame( Numbers = c(1,2,3,4,5,6), Assigned_color = c(NA_character_,NA_character_,NA_character_, NA_character_,NA_character_,NA_character_), stringsAsFactors = FALSE ) df_2 <- data.frame( Value = c(4,5,6), Color = c("Blue", "Orange", "Red"), stringsAsFactors = FALSE ) df_1 %>% left_join(df_2, by = c("Numbers" = "Value")) %>% mutate(Value = ifelse(!is.na(Color), Numbers, NA), Color = ifelse(Value + Numbers == 10, Color, NA_character_)) %>% select(Numbers, Color) #Numbers Color # 1 <NA> # 2 <NA> # 3 <NA> # 4 Blue # 5 Orange # 6 Red
Create dataframe from smallest vector available
I want to create a dataframe from a list of dataframes, specifically from a certain column of those dataframes. However each dataframe contains a different number of observations, so the following code gives me an error. diffs <- data.frame(sensor1 = sensores[[1]]$Diff, sensor2 = sensores[[2]]$Diff, sensor3 = sensores[[3]]$Diff, sensor4 = sensores[[4]]$Diff, sensor5 = sensores[[5]]$Diff) The error: Error in data.frame(sensor1 = sensores[[1]]$Diff, sensor2 = sensores[[2]]$Diff, : arguments imply differing number of rows: 29, 19, 36, 26 Is there some way to force data.frame() to take the minimal number or rows available from each one of the columns, in this case 19? Maybe there is a built-in function in R that can do this, any solution is appreciated but I'd love to get something as general and as clear as possible. Thank you in advance.
I can think of two approaches: Example data: df1 <- data.frame(A = 1:3) df2 <- data.frame(B = 1:4) df3 <- data.frame(C = 1:5) Compute the number of rows of the smallest dataframe: min_rows <- min(sapply(list(df1, df2, df3), nrow)) Use subsetting when combining: diffs <- data.frame(a = df1[1:min_rows,], b = df2[1:min_rows,], c = df3[1:min_rows,] ) diffs a b c 1 1 1 1 2 2 2 2 3 3 3 3 Alternatively, use merge: rowmerge <- function(x,y){ # create row indicators for the merge: x$ind <- 1:nrow(x) y$ind <- 1:nrow(y) out <- merge(x,y, all = T, by = "ind") out["ind"] <- NULL return(out) } Reduce(rowmerge, list(df1, df2, df3)) A B C 1 1 1 1 2 2 2 2 3 3 3 3 4 NA 4 4 5 NA NA 5 To get rid of the rows with NAs, remove the all = T. For your particular case, you would probably call Reduce(rowmerge, sensores), assuming that sensores is a list of dataframes. Note: if you already have an index somewhere (e.g. a timestamp of some sort), then it would be advisable to simply merge on that index instead of creating ind.
Merge on multiple columns results in strange ordering
When two data frames are merged by a numerical column then (by default) they are ordered by that column as a number. However, if two numerical columns are used as the by then it results in a different ordering (in fact it seems as if the numerical columns are converted to strings and sorted as such). Is this expected, or a bug? For example, consider the following two data frames: A <- data.frame(a = 1:12, b = 1, x = runif(12)) B <- data.frame(a = 1:12, b = 1, y = runif(12)) Then merge(A, B, by = 'a') results in a data frame with a column a with values 1, 2, ..., 9, 10, 11, 12 (i.e., the expected numerical ordering). However merge(A, B, by = c('a', 'b')) results in a data frame with a column a with values 1, 10, 11, 12, 2, 3, ..., 8, 9 (i.e., the same ordering as sort(as.character(1:12))).
I guess it's rather a feature than a bug of merge. Inspection of the source code of merge showed that in the case when multiple columns are used for merging, the 'key' columns are internally combined into a vector by using paste(). For example, columns a and b from your data frame A will be represented by the string "1\r1" "2\r1" "3\r1" "4\r1" "5\r1" "6\r1" "7\r1" "8\r1" "9\r1" "10\r1" "11\r1" "12\r1". merge uses this string to sort the resulting data frame, and that is how it ends up with the alphabetical ordering. In the case when you merge only by one column, there is no need for using paste, and therefore sorting is performed by using the original type of the column. Here is the relevant piece of the source code of merge (full text can be obtained by running merge.data.frame without parentheses in R console) if (l.b == 1L) { bx <- x[, by.x] if (is.factor(bx)) bx <- as.character(bx) by <- y[, by.y] if (is.factor(by)) by <- as.character(by) } else { if (!is.null(incomparables)) stop("'incomparables' is supported only for merging on a single column") bx <- x[, by.x, drop = FALSE] by <- y[, by.y, drop = FALSE] names(bx) <- names(by) <- paste0("V", seq_len(ncol(bx))) bz <- do.call("paste", c(rbind(bx, by), sep = "\r")) bx <- bz[seq_len(nx)] by <- bz[nx + seq_len(ny)] }
Using the dplyr package, we can get the following result library("dplyr", lib.loc="~/R/win-library/3.2") full_join(A, B, by=c("a", "b")) a b x y 1 1 1 0.39907404 0.700782559 2 2 1 0.84429488 0.600727090 3 3 1 0.32232471 0.141495156 4 4 1 0.74214210 0.262601640 5 5 1 0.92944116 0.779255689 6 6 1 0.10902661 0.001185645 7 7 1 0.46336478 0.961711785 8 8 1 0.58396008 0.211824751 9 9 1 0.63126074 0.422233784 10 10 1 0.09995935 0.179069642 11 11 1 0.40832159 0.581116173 12 12 1 0.48440814 0.004372634
How to do vlookup and fill down (like in Excel) in R?
I have a dataset about 105000 rows and 30 columns. I have a categorical variable that I would like to assign it to a number. In Excel, I would probably do something with VLOOKUP and fill. How would I go about doing the same thing in R? Essentially, what I have is a HouseType variable, and I need to calculate the HouseTypeNo. Here are some sample data: HouseType HouseTypeNo Semi 1 Single 2 Row 3 Single 2 Apartment 4 Apartment 4 Row 3
If I understand your question correctly, here are four methods to do the equivalent of Excel's VLOOKUP and fill down using R: # load sample data from Q hous <- read.table(header = TRUE, stringsAsFactors = FALSE, text="HouseType HouseTypeNo Semi 1 Single 2 Row 3 Single 2 Apartment 4 Apartment 4 Row 3") # create a toy large table with a 'HouseType' column # but no 'HouseTypeNo' column (yet) largetable <- data.frame(HouseType = as.character(sample(unique(hous$HouseType), 1000, replace = TRUE)), stringsAsFactors = FALSE) # create a lookup table to get the numbers to fill # the large table lookup <- unique(hous) HouseType HouseTypeNo 1 Semi 1 2 Single 2 3 Row 3 5 Apartment 4 Here are four methods to fill the HouseTypeNo in the largetable using the values in the lookup table: First with merge in base: # 1. using base base1 <- (merge(lookup, largetable, by = 'HouseType')) A second method with named vectors in base: # 2. using base and a named vector housenames <- as.numeric(1:length(unique(hous$HouseType))) names(housenames) <- unique(hous$HouseType) base2 <- data.frame(HouseType = largetable$HouseType, HouseTypeNo = (housenames[largetable$HouseType])) Third, using the plyr package: # 3. using the plyr package library(plyr) plyr1 <- join(largetable, lookup, by = "HouseType") Fourth, using the sqldf package # 4. using the sqldf package library(sqldf) sqldf1 <- sqldf("SELECT largetable.HouseType, lookup.HouseTypeNo FROM largetable INNER JOIN lookup ON largetable.HouseType = lookup.HouseType") If it's possible that some house types in largetable do not exist in lookup then a left join would be used: sqldf("select * from largetable left join lookup using (HouseType)") Corresponding changes to the other solutions would be needed too. Is that what you wanted to do? Let me know which method you like and I'll add commentary.
I think you can also use match(): largetable$HouseTypeNo <- with(lookup, HouseTypeNo[match(largetable$HouseType, HouseType)]) This still works if I scramble the order of lookup.
I also like using qdapTools::lookup or shorthand binary operator %l%. It works identically to an Excel vlookup, but it accepts name arguments opposed to column numbers ## Replicate Ben's data: hous <- structure(list(HouseType = c("Semi", "Single", "Row", "Single", "Apartment", "Apartment", "Row"), HouseTypeNo = c(1L, 2L, 3L, 2L, 4L, 4L, 3L)), .Names = c("HouseType", "HouseTypeNo"), class = "data.frame", row.names = c(NA, -7L)) largetable <- data.frame(HouseType = as.character(sample(unique(hous$HouseType), 1000, replace = TRUE)), stringsAsFactors = FALSE) ## It's this simple: library(qdapTools) largetable[, 1] %l% hous
The poster didn't ask about looking up values if exact=FALSE, but I'm adding this as an answer for my own reference and possibly others. If you're looking up categorical values, use the other answers. Excel's vlookup also allows you to match match approximately for numeric values with the 4th argument(1) match=TRUE. I think of match=TRUE like looking up values on a thermometer. The default value is FALSE, which is perfect for categorical values. If you want to match approximately (perform a lookup), R has a function called findInterval, which (as the name implies) will find the interval / bin that contains your continuous numeric value. However, let's say that you want to findInterval for several values. You could write a loop or use an apply function. However, I've found it more efficient to take a DIY vectorized approach. Let's say that you have a grid of values indexed by x and y: grid <- list(x = c(-87.727, -87.723, -87.719, -87.715, -87.711), y = c(41.836, 41.839, 41.843, 41.847, 41.851), z = (matrix(data = c(-3.428, -3.722, -3.061, -2.554, -2.362, -3.034, -3.925, -3.639, -3.357, -3.283, -0.152, -1.688, -2.765, -3.084, -2.742, 1.973, 1.193, -0.354, -1.682, -1.803, 0.998, 2.863, 3.224, 1.541, -0.044), nrow = 5, ncol = 5))) and you have some values you want to look up by x and y: df <- data.frame(x = c(-87.723, -87.712, -87.726, -87.719, -87.722, -87.722), y = c(41.84, 41.842, 41.844, 41.849, 41.838, 41.842), id = c("a", "b", "c", "d", "e", "f") Here is the example visualized: contour(grid) points(df$x, df$y, pch=df$id, col="blue", cex=1.2) You can find the x intervals and y intervals with this type of formula: xrng <- range(grid$x) xbins <- length(grid$x) -1 yrng <- range(grid$y) ybins <- length(grid$y) -1 df$ix <- trunc( (df$x - min(xrng)) / diff(xrng) * (xbins)) + 1 df$iy <- trunc( (df$y - min(yrng)) / diff(yrng) * (ybins)) + 1 You could take it one step further and perform a (simplistic) interpolation on the z values in grid like this: df$z <- with(df, (grid$z[cbind(ix, iy)] + grid$z[cbind(ix + 1, iy)] + grid$z[cbind(ix, iy + 1)] + grid$z[cbind(ix + 1, iy + 1)]) / 4) Which gives you these values: contour(grid, xlim = range(c(grid$x, df$x)), ylim = range(c(grid$y, df$y))) points(df$x, df$y, pch=df$id, col="blue", cex=1.2) text(df$x + .001, df$y, lab=round(df$z, 2), col="blue", cex=1) df # x y id ix iy z # 1 -87.723 41.840 a 2 2 -3.00425 # 2 -87.712 41.842 b 4 2 -3.11650 # 3 -87.726 41.844 c 1 3 0.33150 # 4 -87.719 41.849 d 3 4 0.68225 # 6 -87.722 41.838 e 2 1 -3.58675 # 7 -87.722 41.842 f 2 2 -3.00425 Note that ix, and iy could have also been found with a loop using findInterval, e.g. here's one example for the second row findInterval(df$x[2], grid$x) # 4 findInterval(df$y[2], grid$y) # 2 Which matches ix and iy in df[2] Footnote: (1) The fourth argument of vlookup was previously called "match", but after they introduced the ribbon it was renamed to "[range_lookup]".
Solution #2 of #Ben's answer is not reproducible in other more generic examples. It happens to give the correct lookup in the example because the unique HouseType in houses appear in increasing order. Try this: hous <- read.table(header = TRUE, stringsAsFactors = FALSE, text="HouseType HouseTypeNo Semi 1 ECIIsHome 17 Single 2 Row 3 Single 2 Apartment 4 Apartment 4 Row 3") largetable <- data.frame(HouseType = as.character(sample(unique(hous$HouseType), 1000, replace = TRUE)), stringsAsFactors = FALSE) lookup <- unique(hous) Bens solution#2 gives housenames <- as.numeric(1:length(unique(hous$HouseType))) names(housenames) <- unique(hous$HouseType) base2 <- data.frame(HouseType = largetable$HouseType, HouseTypeNo = (housenames[largetable$HouseType])) which when unique(base2$HouseTypeNo[ base2$HouseType=="ECIIsHome" ]) [1] 2 when the correct answer is 17 from the lookup table The correct way to do it is hous <- read.table(header = TRUE, stringsAsFactors = FALSE, text="HouseType HouseTypeNo Semi 1 ECIIsHome 17 Single 2 Row 3 Single 2 Apartment 4 Apartment 4 Row 3") largetable <- data.frame(HouseType = as.character(sample(unique(hous$HouseType), 1000, replace = TRUE)), stringsAsFactors = FALSE) housenames <- tapply(hous$HouseTypeNo, hous$HouseType, unique) base2 <- data.frame(HouseType = largetable$HouseType, HouseTypeNo = (housenames[largetable$HouseType])) Now the lookups are performed correctly unique(base2$HouseTypeNo[ base2$HouseType=="ECIIsHome" ]) ECIIsHome 17 I tried to edit Bens answer but it gets rejected for reasons I cannot understand.
Starting with: houses <- read.table(text="Semi 1 Single 2 Row 3 Single 2 Apartment 4 Apartment 4 Row 3",col.names=c("HouseType","HouseTypeNo")) ... you can use as.numeric(factor(houses$HouseType)) ... to give a unique number for each house type. You can see the result here: > houses2 <- data.frame(houses,as.numeric(factor(houses$HouseType))) > houses2 HouseType HouseTypeNo as.numeric.factor.houses.HouseType.. 1 Semi 1 3 2 Single 2 4 3 Row 3 2 4 Single 2 4 5 Apartment 4 1 6 Apartment 4 1 7 Row 3 2 ... so you end up with different numbers on the rows (because the factors are ordered alphabetically) but the same pattern. (EDIT: the remaining text in this answer is actually redundant. It occurred to me to check and it turned out that read.table() had already made houses$HouseType into a factor when it was read into the dataframe in the first place). However, you may well be better just to convert HouseType to a factor, which would give you all the same benefits as HouseTypeNo, but would be easier to interpret because the house types are named rather than numbered, e.g.: > houses3 <- houses > houses3$HouseType <- factor(houses3$HouseType) > houses3 HouseType HouseTypeNo 1 Semi 1 2 Single 2 3 Row 3 4 Single 2 5 Apartment 4 6 Apartment 4 7 Row 3 > levels(houses3$HouseType) [1] "Apartment" "Row" "Semi" "Single"
You could use mapvalues() from the plyr package. Initial data: dat <- data.frame(HouseType = c("Semi", "Single", "Row", "Single", "Apartment", "Apartment", "Row")) > dat HouseType 1 Semi 2 Single 3 Row 4 Single 5 Apartment 6 Apartment 7 Row Lookup / crosswalk table: lookup <- data.frame(type_text = c("Semi", "Single", "Row", "Apartment"), type_num = c(1, 2, 3, 4)) > lookup type_text type_num 1 Semi 1 2 Single 2 3 Row 3 4 Apartment 4 Create the new variable: dat$house_type_num <- plyr::mapvalues(dat$HouseType, from = lookup$type_text, to = lookup$type_num) Or for simple replacements you can skip creating a long lookup table and do this directly in one step: dat$house_type_num <- plyr::mapvalues(dat$HouseType, from = c("Semi", "Single", "Row", "Apartment"), to = c(1, 2, 3, 4)) Result: > dat HouseType house_type_num 1 Semi 1 2 Single 2 3 Row 3 4 Single 2 5 Apartment 4 6 Apartment 4 7 Row 3
Using merge is different from lookup in Excel as it has potential to duplicate (multiply) your data if primary key constraint is not enforced in lookup table or reduce the number of records if you are not using all.x = T. To make sure you don't get into trouble with that and lookup safely, I suggest two strategies. First one is to make a check on a number of duplicated rows in lookup key: safeLookup <- function(data, lookup, by, select = setdiff(colnames(lookup), by)) { # Merges data to lookup making sure that the number of rows does not change. stopifnot(sum(duplicated(lookup[, by])) == 0) res <- merge(data, lookup[, c(by, select)], by = by, all.x = T) return (res) } This will force you to de-dupe lookup dataset before using it: baseSafe <- safeLookup(largetable, house.ids, by = "HouseType") # Error: sum(duplicated(lookup[, by])) == 0 is not TRUE baseSafe<- safeLookup(largetable, unique(house.ids), by = "HouseType") head(baseSafe) # HouseType HouseTypeNo # 1 Apartment 4 # 2 Apartment 4 # ... Second option is to reproduce Excel behaviour by taking the first matching value from the lookup dataset: firstLookup <- function(data, lookup, by, select = setdiff(colnames(lookup), by)) { # Merges data to lookup using first row per unique combination in by. unique.lookup <- lookup[!duplicated(lookup[, by]), ] res <- merge(data, unique.lookup[, c(by, select)], by = by, all.x = T) return (res) } baseFirst <- firstLookup(largetable, house.ids, by = "HouseType") These functions are slightly different from lookup as they add multiple columns.
The lookup package can be used here: library(lookup) # reference data hous <- data.frame(HouseType=c("Semi","Single","Row","Single","Apartment","Apartment","Row"), HouseTypeNo=c(1,2,3,2,4,4,3)) # new large data with HouseType but no HouseTypeNo largetable <- data.frame(HouseType = sample(unique(hous$HouseType), 1000, replace = TRUE)) # vector approach largetable$num1 <- lookup(largetable$HouseType, hous$HouseType, hous$HouseTypeNo) # dataframe approach largetable$num2 <- vlookup(largetable$HouseType, hous, "HouseType", "HouseTypeNo") head(largetable) # HouseType num1 num2 # 1 Semi 1 1 # 2 Semi 1 1 # 3 Apartment 4 4 # 4 Semi 1 1 # 5 Single 2 2 # 6 Single 2 2