Look up from different dataframes depending on a column - r

Supposing I have the following dataframes:
d1 <- data.frame(index = c(1,2,3,4), location = c('barn', 'house', 'restaurant', 'tomb'), random = c(5,3,2,1), different_col1 = c(66,33,22,11))
d2 <- data.frame(index = c(1,2,3,4), location = c('server', 'computer', 'home', 'dictionary'), random = c(1,7,2,9), differen_col2 = c('hi', 'there', 'different', 'column'))
What I am trying to do is get the location based on the index and what dataframe it is. So I have the following:
data <- data.frame(src = c('one', 'one', 'two', 'one', 'two'), index = c(1,4,2,3,2))
Where src indicates which dataframe the data should come from and index, the value in index from the index column.
src | index
-------------
one | 1
one | 4
two | 2
one | 3
two | 2
And I would like it to become:
src | index | location
-----------------------
one | 1 | barn
one | 4 | tomb
two | 2 | computer
one | 3 | restaurant
two | 2 | computer
Due to the size of my data I would like to avoid merge or comparable joins (sqldf, etc).

Here's one way to add a new column by reference using data.table:
require(data.table)
setDT(d1); setDT(d2); setDT(data) # convert all data.frames to data.tables
data[src == "one", location := d1[.SD, location, on="index"]]
data[src == "two", location := d2[.SD, location, on="index"]]
.SD stands for subset of data, and contains all columns in data that matches the condition provided in i-argument.
See the vignettes for more.
You can use match in the expression to the right of := as well instead of extracting location using a join. But it'd not be extensible if you'd want to match on multiple columns.

library(dplyr)
mutate(data,
location = ifelse(src == "one",
as.character(d1[index, "location"]),
as.character(d2[index, "location"])))
output
src index location
1 one 1 barn
2 one 4 tomb
3 two 2 computer
4 one 3 restaurant
5 two 2 computer

data.table will help you to deal with Big Data much more efficiently.
You could either use match or a special data.table implementation of merge that's much faster than the merge of my original solution, as we discussed in the comments.
Here's an example:
require(data.table)
d1 <- data.frame(index = c(1,2,3,4), location = c('barn', 'house', 'restaurant', 'tomb'), random = c(5,3,2,1), different_col1 = c(66,33,22,11))
d2 <- data.frame(index = c(1,2,3,4), location = c('server', 'computer', 'home', 'dictionary'), random = c(1,7,2,9), differen_col2 = c('hi', 'there', 'different', 'column'))
mydata <- data.table(src = c('one', 'one', 'two', 'one', 'two'), index = c(1,4,2,3,2))
mydata.d1 <- mydata[mydata$src == "one",]
mydata.d2 <- mydata[mydata$src == "two",]
mydata.d1 <- merge(mydata.d1, d1, all.x = T, by = "index")
mydata.d2 <- merge(mydata.d2, d2, all.x = T, by = "index")
# If you want to keep the 'different column' values from d1 and d2:
mydata <- rbind(mydata.d1, mydata.d2, fill = T)
mydata
index src location random different_col1 differen_col2
1: 1 one barn 5 66 NA
2: 3 one restaurant 2 22 NA
3: 4 one tomb 1 11 NA
4: 2 two computer 7 NA there
5: 2 two computer 7 NA there
# If you don't want to keep those 'different column' values:
mydata <- rbind(mydata.d1[,.(index, src, location)], mydata.d2[,.(index, src, location)])
mydata
index src location
1: 1 one barn
2: 3 one restaurant
3: 4 one tomb
4: 2 two computer
5: 2 two computer

Base solution: use a character index to chose the correct dataframe and then use mapply to handle submission of the multiple "parallel arguments.
dput(dat)
structure(list(src = c("one", "one", "two", "one", "two"), X. = c("|",
"|", "|", "|", "|"), index = c(1L, 4L, 2L, 3L, 2L), location = structure(c(1L,
4L, 5L, 3L, 5L), .Label = c("barn", "house", "restaurant", "tomb",
"computer", "dictionary", "home", "server"), class = "factor")), .Names = c("src",
"X.", "index", "location"), row.names = c(NA, -5L), class = "data.frame")
May need to use stringsAsFactor to ensure character argument.
dat$location <- mapply(function(whichd,i) dlist[[whichd]][i,'location'], whichd=dat$src, i=dat$index)
> dat
src X. index location
1 one | 1 barn
2 one | 4 tomb
3 two | 2 computer
4 one | 3 restaurant
5 two | 2 computer
>

Related

Creating a function to summarise the paired "Factors" which are currently stored in 2 columns of a dataframe

I have a large (>200000 observations) flat file dataframe which has multiple "paired" codes throughout it. For each pair, one column contains a numerical code, the second is a description of the code. I have set both the codes and descriptions to "factors".
An example of the dataframe is below
|-------------|---------------|---------------|-------------|---------------|---------
| ID | Unit_CD | Unit | Name_CD | Name | etc
|-------------|---------------|---------------|-------------|---------------|---------
| 01 | 12 | Bob | A01 | EPID | etc
| 02 | 10 | Sue | A04 | UPIM | etc
| 03 | 12 | Bob | V03 | AVRM | etc
| 04 | 14 | Moo | A04 | UPIM | etc
I would like to create a function where you can input the 2 paired column names and it will return a concatenated field which displays the numeric code and the description as per below:
'code.names(df,Unit_CD,Unit)'
OUTPUT:
Unit Codes
10: Sue
12: Bob
14: Moo
I have written the following code however I can not get it to accept the column names as an input to the function:
code.names <- function(df,column1, column2){
n <-count(df,column1,column2)
CD.V <- as.vector(n[,1])
CD.Code <- as.vector(n[,2])
i <- nrows(n)
for (i in 1:n){
paste(CD.V[i],CD.Code[i])
}
}
The error I am getting is
Error: Must group by variables found in `.data`.
* Column `column1` is not found.
* Column `column2` is not found.
As I am doing this multiple times through the code, I would prefer to set this up as a function, however any other method of achieving my end goal would still be appreciated.
Hope the code below works for your goal
code.names <- function(df,column1, column2) unique(paste0(df[[column1]],":",df[[column2]]))
An option with unite
library(dplyr)
library(tidyr)
df %>%
distinct(Unit_CD, Unit) %>%
unite(New, Unit_CD, Unit, sep=": ")
# New
#1 12: Bob
#2 10: Sue
#3 14: Moo
data
df <- structure(list(ID = 1:4, Unit_CD = c(12L, 10L, 12L, 14L), Unit = c("Bob",
"Sue", "Bob", "Moo"), Name_CD = c("A01", "A04", "V03", "A04"),
Name = c("EPID", "UPIM", "AVRM", "UPIM")), class = "data.frame",
row.names = c(NA,
-4L))
I would suggest this approach with a new function using your data and the names of the columns to be concatenated:
#Data
df <- structure(list(ID = 1:4, Unit_CD = c(12L, 10L, 12L, 14L), Unit = c("Bob",
"Sue", "Bob", "Moo"), Name_CD = c("A01", "A04", "V03", "A04"),
Name = c("EPID", "UPIM", "AVRM", "UPIM")), class = "data.frame", row.names = c(NA,
-4L))
Code:
#Function
myfun <- function(x,cola,colb)
{
var <- paste0(x[,cola],': ',x[,colb])
var <- unique(var)
var <- data.frame(var)
return(var)
}
#Apply
myfun(df,'Unit_CD', 'Unit')
Output:
var
1 12: Bob
2 10: Sue
3 14: Moo
You can use duplicated to keep only unique values in the dataframe.
code.names <- function(df,column1, column2) {
df1 <- df[!duplicated(df[c(column1, column2)]), ]
cat(paste(df1[[column1]], df1[[column2]], sep = ":", collapse = "\n"))
}
code.names(df, 'Unit_CD','Unit')
#12:Bob
#10:Sue
#14:Moo

Splitting coloumn with differing syntax in R

I am having some trouble cleaning up my data. It consists of a list of sold houses. It is made up of the sell price, no. of rooms, m2 and the address.
As seen below the address is in one string.
Head(DF, 3)
Address Price m2 Rooms
Petersvej 1772900 Hoersholm 10.000 210 5
Annasvej 2B2900 Hoersholm 15.000 230 4
Krænsvej 125800 Lyngby C 10.000 210 5
A Mivs Alle 119800 Hjoerring 1.300 70 3
The syntax for the address coloumn is: road name, road no., followed by a 4 digit postalcode and the city name(sometimes two words).
Also need to extract the postalcode.. been looking at 'stringi' package haven't been able to find any examples..
any pointers are very much appreciated
1) Using separate in tidyr separate the subfields of Address into 3 fields merging anything left over into the last and then use separate again to split off the last 4 digits in the Number column that was generated in the first separate.
library(dplyr)
library(tidyr)
DF %>%
separate(Address, into = c("Road", "Number", "City"), extra = "merge") %>%
separate(Number, into = c("StreetNo", "Postal"), sep = -4)
giving:
Road StreetNo Postal City Price m2 Rooms CITY
1 Petersvej 77 2900 Hoersholm 10 210 5 Hoersholm
2 Annasvej 121B 2900 Hoersholm 15 230 4 Hoersholm
3 Krænsvej 12 5800 Lyngby C 10 210 5 C
2) Alternately, insert commas between the subfields of Address and then use separate to split the subfields out. It gives the same result as (1) on the input shown in the Note below.
DF %>%
mutate(Address = sub("(\\S.*) +(\\S+)(\\d{4}) +(.*)", "\\1,\\2,\\3,\\4", Address)) %>%
separate(Address, into = c("Road", "Number", "Postal", "City"), sep = ",")
Note
The input DF in reproducible form is:
DF <-
structure(list(Address = structure(c(3L, 1L, 2L), .Label = c("Annasvej 121B2900 Hoersholm",
"Krænsvej 125800 Lyngby C", "Petersvej 772900 Hoersholm"), class = "factor"),
Price = c(10, 15, 10), m2 = c(210L, 230L, 210L), Rooms = c(5L,
4L, 5L), CITY = structure(c(2L, 2L, 1L), .Label = c("C",
"Hoersholm"), class = "factor")), class = "data.frame", row.names = c(NA,
-3L))
Update
Added and fixed (2).
Check out the cSplit function from the splitstackshape package
library(splitstackshape)
df_new <- cSplit(df, splitCols = "Address", sep = " ")
#This will split your address column into 4 different columns split at the space
#you can then add an ifelse block to combine the last 2 columns to make up the city like
df_new$City <- ifelse(is.na(df_new$Address_4), as.character(df_new$Address_3), paste(df_new$Address_3, df_new$Address_4, sep = " "))
One way to do this is with regex.
In this instance you may use a simple regular expression which will match all alphabetical characters and space characters which lead to the end of the string, then trim the whitespace off.
library(stringr)
DF <- data.frame(Address=c("Petersvej 772900 Hoersholm",
"Annasvej 121B2900 Hoersholm",
"Krænsvej 125800 Lyngby C"))
DF$CITY <- str_trim(str_extract(DF$Address, "[a-zA-Z ]+$"))
This will give you the following output:
Address CITY
1 Petersvej 772900 Hoersholm Hoersholm
2 Annasvej 121B2900 Hoersholm Hoersholm
3 Krænsvej 125800 Lyngby C Lyngby C
In R the stringr package is preferred for regex because it allows for multiple-group capture, which in this example could allow you to separate each component of the address with one expression.

Checking if keyword in one table is within a string in another table using R

I've been trying to solve this issue with mapply, but I believe I will have to use several nested applies to make this work, and it has gotten real confusing.
The problem is as follows:
Dataframe one contains around 400 keywords. These fall into roughly 15 categories.
Dataframe two contains a string description field, and 15 additional columns, each named to correspond to the categories mentioned in dataframe one. This has millions of rows.
If a keyword from dataframe 1 exists in the string field in dataframe 2, the category in which the keyword exists should be flagged in dataframe 2.
What I want should look something like this:
> #Dataframe1 df1
>> keyword category
>> cat A
>> dog A
>> pig A
>> crow B
>> pigeon B
>> hawk B
>> catfish C
>> carp C
>> ...
>>
> #Dataframe2 df2
>> description A B C ....
>> false cat 1 0 0 ....
>> smiling pig 1 0 0 ....
>> shady pigeon 0 1 0 ....
>> dogged dog 2 0 0 ....
>> sad catfish 0 0 1 ....
>> hawkward carp 0 1 1 ....
>> ....
I tried to use mapply to get this to work but it fails, giving me the error "longer argument not a multiple of length of shorter". It also computes this only for the first string in df2. I haven't proceeded beyond this stage, i.e. attempting to get category flags.
> mapply(grepl, pattern = df1$keyword, x = df2$description)
Could anyone be of help? I thank you very much. I am new to R so it would also help if someone could mention some 'thumb rules' for turning loops into apply functions. I cannot afford to use loops to solve this as it would take way too much time.
There might be a more elegant way to do this but this is what I came up with:
## Your sample data:
df1 <- structure(list(keyword = c("cat", "dog", "pig", "crow", "pigeon", "hawk", "catfish", "carp"),
category = c("A", "A", "A", "B", "B", "B", "C", "C")),
.Names = c("keyword", "category"),
class = "data.frame", row.names = c(NA,-8L))
df2 <- structure(list(description = structure(c(2L, 6L, 5L, 1L, 4L,3L),
.Label = c("dogged dog", "false cat", "hawkward carp", "sad catfish", "shady pigeon", "smiling pig"), class = "factor")),
.Names = "description", row.names = c(NA, -6L), class = "data.frame")
## Load packages:
library(stringr)
library(dplyr)
library(tidyr)
## For each entry in df2$description count how many times each keyword
## is contained in it:
outList <- lapply(df2$description, function(description){
outDf <- data.frame(description = description,
value = vapply(stringr::str_extract_all(description, df1$keyword),
length, numeric(1)), category = df1$category)
})
## Combine to one long data frame and aggregate by category:
outLongDf<- do.call('rbind', outList) %>%
group_by(description, category) %>%
dplyr::summarise(value = sum(value))
## Reshape from long to wide format:
outWideDf <- tidyr::spread(data = outLongDf, key = category,
value = value)
outWideDf
# Source: local data frame [6 x 4]
# Groups: description [6]
#
# description A B C
# * <fctr> <dbl> <dbl> <dbl>
# 1 dogged dog 2 0 0
# 2 false cat 1 0 0
# 3 hawkward carp 0 1 1
# 4 sad catfish 1 0 1
# 5 shady pigeon 1 1 0
# 6 smiling pig 1 0 0
This approach, however also catches the "pig" in "pigeon" and the "cat" in "catfish". I don't know if this is what you want, though.
What you are looking for is a so-called document-term-matrix (or dtm in short), which stems from NLP (Natural Language Processing). There are many options available. I prefer text2vec. This package is blazingly fast (I wouldn't be surprised if it would outperform the other solutions here by a large magnitude) especially in combination with tokenizers.
In your case the code would look something like this:
# Create the data
df1 <- structure(list(keyword = c("cat", "dog", "pig", "crow", "pigeon", "hawk", "catfish", "carp"),
category = c("A", "A", "A", "B", "B", "B", "C", "C")),
.Names = c("keyword", "category"),
class = "data.frame", row.names = c(NA,-8L))
df2 <- structure(list(description = structure(c(2L, 6L, 5L, 1L, 4L,3L),
.Label = c("dogged dog", "false cat", "hawkward carp", "sad catfish", "shady pigeon", "smiling pig"), class = "factor")),
.Names = "description", row.names = c(NA, -6L), class = "data.frame")
# load the libraries
library(text2vec) # to create the dtm
library(tokenizers) # to help creating the dtm
library(reshape2) # to reshape the data from wide to long
# 1. create the vocabulary from the keywords
vocabulary <- vocab_vectorizer(create_vocabulary(itoken(df1$keyword)))
# 2. create the dtm
dtm <- create_dtm(itoken(as.character(df2$description)), vocabulary)
# 3. convert the sparse-matrix to a data.frame
dtm_df <- as.data.frame(as.matrix(dtm))
dtm_df$description <- df2$description
# 4. melt to long format
df_result <- melt(dtm_df, id.vars = "description", variable.name = "keyword")
df_result <- df_result[df_result$value == 1, ]
# 5. combine the data, i.e., add category
df_final <- merge(df_result, df1, by = "keyword")
# keyword description value category
# 1 carp hawkward carp 1 C
# 2 cat false cat 1 A
# 3 catfish sad catfish 1 C
# 4 dog dogged dog 1 A
# 5 pig smiling pig 1 A
# 6 pigeon shady pigeon 1 B
Whatever the implementation, counting the number of matches per category needs k x d comparisons, where k is the number of keywords and d the number of descriptions.
There are a few tricks to make solve this problem fast and without a lot of memory:
Use vectorized operations. These can be performed a lot quicker than use for loops. Note that lapply, mapply or vapply are just shorthand for for loops. I parallelize (see next) over the keywords such that the vectorization can be over the descriptions which is the largest dimension.
Use parallelization. Optimally using your multiple cores speeds up the proces at the cost of an increase in memory (since every core needs its own copy).
Example:
keywords <- stringi::stri_rand_strings(400, 2)
categories <- letters[1:15]
keyword_categories <- sample(categories, 400, TRUE)
descriptions <- stringi::stri_rand_strings(3e6, 20)
keyword_occurance <- function(word, list_of_descriptions) {
description_keywords <- str_detect(list_of_descriptions, word)
}
category_occurance <- function(category, mat) {
rowSums(mat[,keyword_categories == category])
}
list_keywords <- mclapply(keywords, keyword_occurance, descriptions, mc.cores = 8)
df_keywords <- do.call(cbind, list_keywords)
list_categories <- mclapply(categories, category_occurance, df_keywords, mc.cores = 8)
df_categories <- do.call(cbind, list_categories)
With my computer this takes 140 seconds and 14GB RAM to match 400 keywords in 15 categories to 3 million descriptions.

How to create R output likes confusion matrix table

I have two of directories. The name of first directory is "model" and the second directory is "test", the list of files in both of directories are same but have different content. The total number of files in both of directories also same, that is 37 files.
I show the example of content from one of file.
First file from model directory
Name file : Model_A5B45
data
1 papaya | durian | orange | grapes
2 orange
3 grapes
4 banana | durian
5 tomato
6 apple | tomato
7 apple
8 mangostine
9 strawberry
10 strawberry | mango
dput output :
structure(list(data = structure(c(7L, 6L, 4L, 3L, 10L, 2L, 1L,
5L, 8L, 9L), .Label = c("apple", "apple | tomato", "banana | durian",
"grapes", "mangostine ", "orange", "papaya | durian | orange | grapes",
"strawberry", "strawberry | mango", "tomato"), class = "factor")), .Names = "data", class = "data.frame", row.names = c(NA,
-10L))
Second file in test directory
Name file: Test_A5B45
data
1 apple
2 orange | apple | mango
3 apple
4 banana
5 grapes
6 papaya
7 durian
8 tomato | orange | papaya | durian
dput output:
structure(list(data = structure(c(1L, 5L, 1L, 2L, 4L, 6L, 3L,
7L), .Label = c("apple", "banana", "durian", "grapes", "orange | apple | mango",
"papaya", "tomato | orange | papaya | durian"), class = "factor")), .Names = "data", class = "data.frame", row.names = c(NA,
-8L))
I want to calculate the percentage of intersect and except data from files in directory test to files in directory model.
This is example of my code only for two of files (Model_A5B45 and Test_A5B45).
library(dplyr)
data_test <- read.csv("Test_A5B45")
data_model <- read.csv("Model_A5B45")
intersect <- semi_join(data_test,data_model)
except <- anti_join(data_test,data_model)
except_percentage <- (nrow(except)/nrow(data_test))*100
intersect_percentage <- (nrow(intersect)/nrow(data_test))*100
sprintf("%s/%s",intersect_percentage,except_percentage)
Output : "37.5/62.5"
My question is, I want to implement my code to all of files (looping in both of directories) so the output will looks like confusion matrix.
Example of my expected output:
## y
## Model_A5B45 Model_A6B46 Model_A7B47
## Test_A5B45 37.5/62.5 value value
## Test_A6B46 value value value
## Test_A7B47 value value value
My answer:
I've create code that can process those thing, but I am still do not know how to make output looks like confusion matrix.
This is my code: (I don't know this is efficient or not, I use for loop)
f_performance_testing <- function(data_model_path, data_test_path){
library(dplyr)
data_model <- read.csv(data_model_path, header=TRUE)
data_test <- read.csv(data_test_path, header=TRUE)
intersect <- semi_join(data_test,data_model)
except <- anti_join(data_test,data_model)
except_percentage <- (nrow(except)/nrow(data_test))*100
intersect_percentage <- (nrow(intersect)/nrow(data_test))*100
return(list("intersect"=intersect_percentage,"except"=except_percentage))
}
for (model in model_list){
for (test in test_list){
result <- f_performance_testing(model,test)
intersect_percentage <- round(result$intersect,3)
except_percentage <- round(result$except,3)
final_output <- sprintf("intersect : %s | except : %s",intersect_percentage,except_percentage)
cat(print(paste(substring(model,57),substring(test,56), final_output,sep=",")),file="outfile.txt",append=TRUE,"\n")
print("Writing to file.......")
}
}
The output is:
Model_A5B45,Test_A5B45, 37.5/62.5
Model_A5B45,Test_A6B46, value
Model_A5B45,Test_A7B47, value
Model_A6B46,......
Model_A7B47,.....
...............
......
....
How can I transform this output as looks like confusion matrix table?
This won't answer your question directly, but hopefully gives you enough information to arrive at your own solution.
I would recommend creating a function like the following:
myFun <- function(model, test, datasource) {
model <- datasource[[model]]
test <- datasource[[test]]
paste(rev(mapply(function(x, y) (x/y)*100,
lapply(split(test, test %in% model), length),
length(test))),
collapse = "/")
}
This function is to be used with a two-column data.frame, where the columns represent all the combinations of "test" and "model" values (why work with a data.frame structure when a character vector would suffice?)
Here's an example of such a data.frame (other sample data is found at the end of the answer).
models <- c("model_1", "model_2", "model_3")
tests <- c("test_1", "test_2", "test_3")
A <- expand.grid(models, tests, stringsAsFactors = FALSE)
Next, create a named list of your models and tests. If you've read your data in using lapply, it is likely you might have names to work with anyway.
dataList <- mget(c(models, tests))
Now, calculate the relevant values. Here, we can use apply to cycle through each row and perform the relevant calculation.
A$value <- apply(A, 1, function(x) myFun(x[1], x[2], dataList))
Finally, you reshape the data from a "long" form to a "wide" form.
reshape(A, direction = "wide", idvar = "Var1", timevar = "Var2")
# Var1 value.test_1 value.test_2 value.test_3
# 1 model_1 75/25 100 75/25
# 2 model_2 50/50 50/50 62.5/37.5
# 3 model_3 62.5/37.5 50/50 87.5/12.5
Here's some sample data. Note that they are basic character vectors and not data.frames.
set.seed(1)
sets <- c("A", "A|B", "B", "C", "A|B|C", "A|C", "D", "A|D", "B|C", "B|D")
test_1 <- sample(sets, 8, TRUE)
model_1 <- sample(sets, 10, TRUE)
test_2 <- sample(sets, 8, TRUE)
model_2 <- sample(sets, 10, TRUE)
test_3 <- sample(sets, 8, TRUE)
model_3 <- sample(sets, 10, TRUE)
In a real world application, you would probably do something like:
testList <- lapply(list.files(path = "path/to/test/files"),
function(x) read.csv(x, stringsAsFactors = FALSE)$data)
modelList <- lapply(list.files(path = "path/to/model/files"),
function(x) read.csv(x, stringsAsFactors = FALSE)$data)
dataList <- c(testList, modelList)
But, this is pure speculation on my part based on what you've shared in your question as working code (for example, csv files with no file extension).

Replacing values in a data frame column

Given a large data frame with a column that has unique values
(ONE, TWO, THREE, FOUR, FIVE, SIX, SEVEN, EIGHT)
I want to replace some of the values. For example, every occurrence of 'ONE' should be replaced by '1' and
'FOUR' -> '2SQUARED'
'FIVE' -> '5'
'EIGHT' -> '2CUBED'
Other values should remain as they are.
IF/ELSE will run forever. How to apply a vectorized solution? Is match() the corrct way to go?
Using #rnso data set
library(plyr)
transform(data, vals = mapvalues(vals,
c('ONE', 'FOUR', 'FIVE', 'EIGHT'),
c('1','2SQUARED', '5', '2CUBED')))
# vals
# 1 1
# 2 TWO
# 3 THREE
# 4 2SQUARED
# 5 5
# 6 SIX
# 7 SEVEN
# 8 2CUBED
Try following using base R:
data = structure(list(vals = structure(c(4L, 8L, 7L, 3L, 2L, 6L, 5L,
1L), .Label = c("EIGHT", "FIVE", "FOUR", "ONE", "SEVEN", "SIX",
"THREE", "TWO"), class = "factor")), .Names = "vals", class = "data.frame", row.names = c(NA,
-8L))
initial = c('ONE', 'FOUR', 'FIVE', 'EIGHT')
final = c('1','2SQUARED', '5', '2CUBED')
myfn = function(ddf, init, fin){
refdf = data.frame(init,fin)
ddf$new = refdf[match(ddf$vals, init), 'fin']
ddf$new = as.character(ddf$new)
ndx = which(is.na(ddf$new))
ddf$new[ndx]= as.character(ddf$vals[ndx])
ddf
}
myfn(data, initial, final)
vals new
1 ONE 1
2 TWO TWO
3 THREE THREE
4 FOUR 2SQUARED
5 FIVE 5
6 SIX SIX
7 SEVEN SEVEN
8 EIGHT 2CUBED
>
Your column is probably a factor. Give this a try. Using rnso's data, I'd recommend you first create two vectors of values to change from and values to change to
from <- c("FOUR", "FIVE", "EIGHT")
to <- c("2SQUARED", "5", "2CUBED")
Then replace the factors with
with(data, levels(vals)[match(from, levels(vals))] <- to)
This gives
data
# vals
# 1 ONE
# 2 TWO
# 3 THREE
# 4 2SQUARED
# 5 5
# 6 SIX
# 7 SEVEN
# 8 2CUBED

Resources