Replace value from dataframe column with value from keyvalue lookup - r

I want to replace certain values in a data frame column with values from a lookup table. I have the values in a list, stuff.kv, and many values are stored in the list (but some may not be).
stuff.kv <- list()
stuff.kv[["one"]] <- "thing"
stuff.kv[["two"]] <- "another"
#etc
I have a dataframe, df, which has multiple columns (say 20), with assorted names. I want to replace the contents of the column named 'stuff' with values from 'lookup'.
I have tried building various apply methods, but nothing has worked.
I built a function, which process a list of items and returns the mutated list,
stuff.lookup <- function(x) {
for( n in 1:length(x) ) {
if( !is.null( stuff.kv[[x[n]]] ) ) x[n] <- stuff.kv[[x[n]]]
}
return( x )
}
unlist(lapply(df$stuff, stuff.lookup))
The apply syntax is bedeviling me.

Since you made such a nice lookup table, You can just use it to change the values. No loops or apply needed.
## Sample Data
set.seed(1234)
DF = data.frame(stuff = sample(c("one", "two"), 8, replace=TRUE))
## Make the change
DF$stuff = unlist(stuff.kv[DF$stuff])
DF
stuff
1 thing
2 another
3 another
4 another
5 another
6 another
7 thing
8 thing

Below is a more general solution building on #G5W's answer as it doesn't cover the case where your original data frame has values that don't exist in the lookup table (which would result in length mismatch error):
library(dplyr)
stuff.kv <- list(one = "another", two = "thing")
df <- data_frame(
stuff = rep(c("one", "two", "three"), each = 3)
)
df <- df %>%
mutate(stuff = paste(stuff.kv[stuff]))

Related

R: insert rows at specific places in dataframe

I can't seem to find an example to help me solve a particular problem in R. I have a data frame that looks like this:
tmp = data.frame(group = c(rep("A", 5), rep("B",2), rep("C",6)), value = rnorm(13))
In reality I have thousands of columns and rows with many different values for group. The rows in the data frame are ordered by group.
I'd like to insert a new row above the first occurrence of each group. I'd also like for these new rows to only contain a value (the same value) in the first column (although I can make do if columns 2:ncol(tmp) contain NAs). Using the example data frame above, the end result should look like this:
group value
GROUP
A -1.7596279
A -0.8273928
A -0.3515738
A -0.7547999
A 0.5700747
GROUP
B -1.9676482
B 0.3996858
GROUP
C 0.1047832
C 0.5903711
C -1.3687259
C 0.3688415
C 1.3674403
C 0.8880089
Is there a way to do this? I can come up with a list of rows containing the first instance of each group. I was originally thinking that I could use this information to define where new rows should be inserted, but not sure if this is the best way to go.
I tried to create a function that does what you want it to do:
addEmptyRows <- function(D)
{
output <- tmp
i <- 1
while (i < NROW(output)) {
if(output$group[i] != output$group[i+1])
{
output <- rbind(output[1:i,],c("GROUP","NA"),output[(i+1):NROW(output),])
i <- i+1
}
i <- i+1
}
return(rbind(c("GROUP","NA"),output))
}
If you apply this function to your dataframe:
addEmptyRows(tmp)
It gives you the desired dataframe. Does this help you?
You could use something like this:
tmp <- data.frame(group = c(rep("A", 5), rep("B",2), rep("C",6)), value = rnorm(13))
divider <- data.frame(group = "GROUP", value = NA)
do.call(rbind, unlist(lapply(split(tmp, tmp$group),
function(x) list(divider, x)), recursive = F))

Using a loop to select a column names from a list

I've been struggling with column selection with lists in R. I've loaded a bunch of csv's (all with different column names and different number of columns) with the goal of extracting all the columns that have the same name (just phone_number, subregion, and phonetype) and putting them together into a single data frame.
I can get the columns I want out of one list element with this;
var<-data[[1]] %>% select("phone_number","Subregion", "PhoneType")
But I cannot select the columns from all the elements in the list this way, just one at a time.
I then tried a for loop that looks like this:
new.function <- function(a) {
for(i in 1:a) {
tst<-datas[[i]] %>% select("phone_number","Subregion", "PhoneType")
}
print(tst)
}
But when I try:
new.function(5)
I'll only get the columns from the 5th element.
I know this might seem like a noob question for most, but I am struggling to learn lists and loops and R. I'm sure I'm missing something very easy to make this work. Thank you for your help.
Another way you could do this is to make a function that extracts your columns and apply it to all data.frames in your list with lapply:
library(dplyr)
extractColumns = function(x){
select(x,"phone_number","Subregion", "PhoneType")
#or x[,c("phone_number","Subregion","PhoneType")]
}
final_df = lapply(data,extractColumns) %>% bind_rows()
The way you have your loop set up currently is only saving the last iteration of the loop because tst is not set up to store more than a single value and is overwritten with each step of the loop.
You can establish tst as a list first with:
tst <- list()
Then in your code be explicit that each step is saved as a seperate element in the list by adding brackets and an index to tst. Here is a full example the way you were doing it.
#Example data.frame that could be in datas
df_1 <- data.frame("not_selected" = rep(0, 5),
"phone_number" = rep("1-800", 5),
"Subregion" = rep("earth", 5),
"PhoneType" = rep("flip", 5))
# Another bare data.frame that could be in datas
df_2 <- data.frame("also_not_selected" = rep(0, 5),
"phone_number" = rep("8675309", 5),
"Subregion" = rep("mars", 5),
"PhoneType" = rep("razr", 5))
# Datas is a list of data.frames, we want to pull only specific columns from all of them
datas <- list(df_1, df_2)
#create list to store new data.frames in once columns are selected
tst <- list()
#Function for looping through 'a' elements
new.function <- function(a) {
for(i in 1:a) {
tst[[i]] <- datas[[i]] %>% select("phone_number","Subregion", "PhoneType")
}
print(tst)
}
#Proof of concept for 2 elements
new.function(2)

Access variable dataframe in R loop

If I am working with dataframes in a loop, how can I use a variable data frame name (and additionally, variable column names) to access data frame contents?
dfnames <- c("df1","df2")
df1 <- df2 <- data.frame(X = sample(1:10),Y = sample(c("yes", "no"), 10, replace = TRUE))
for (i in seq_along(dfnames)){
curr.dfname <- dfnames[i]
#how can I do this:
curr.dfname$X <- 42:52
#...this
dfnames[i]$X <- 42:52
#or even this doubly variable call
for (j in 1_seq_along(colnames(curr.dfname)){
curr.dfname$[colnames(temp[j])] <- 42:52
}
}
You can use get() to return a variable reference based on a string of its name:
> x <- 1:10
> get("x")
[1] 1 2 3 4 5 6 7 8 9 10
So, yes, you could iterate through dfnames like:
dfnames <- c("df1","df2")
df1 <- df2 <- data.frame(X = sample(1:10), Y = sample(c("yes", "no"), 10, replace = TRUE))
for (cur.dfname in dfnames)
{
cur.df <- get(cur.dfname)
# for a fixed column name
cur.df$X <- 42:52
# iterating through column names as well
for (j in colnames(cur.df))
{
cur.df[, j] <- 42:52
}
}
I really think that this is gonna be a painful approach, though. As the commenters say, if you can get the data frames into a list and then iterate through that, it'll probably perform better and be more readable. Unfortunately, get() isn't vectorised as far as I'm aware, so if you only have a string list of data frame names, you'll have to iterate through that to get a data frame list:
# build data frame list
df.list <- list()
for (i in 1:length(dfnames))
{
df.list[[i]] <- get(dfnames[i])
}
# iterate through data frames
for (cur.df in df.list)
{
cur.df$X <- 42:52
}
Hope that helps!
2018 Update: I probably wouldn't do something like this anymore. Instead, I'd put the data frames in a list and then use purrr:map(), or, the base equivalent, lapply():
library(tidyverse)
stuff_to_do = function(mydata) {
mydata$somecol = 42:52
# … anything else I want to do to the current data frame
mydata # return it
}
df_list = list(df1, df2)
map(df_list, stuff_to_do)
This brings back a list of modified data frames (although you can use variants of map(), map_dfr() and map_dfc(), to automatically bind the list of processed data frames row-wise or column-wise respectively. The former uses column names to join, rather than column positions, and it can also add an ID column using the .id argument and the names of the input list. So it comes with some nice added functionality over lapply()!

For loop to plyr function

I have a character array that holds the column names and values for a row in a data frame. Unfortunately, if the value of a specific entry is zero, the column name and value are not listed in the array. I create my desired data frame with this information, but I rely on a "for loop".
I want to utilize plyr to avoid the for loop in the working code below.
types <- c("one", "two", "three") # My data
entry <- c("one(1)", "three(2)") # My data
values <- function(entry, types)
{
frame<- setNames(as.data.frame(matrix(0, ncol = length(types), nrow = 1)), types)
for(s1 in 1:length(entry))
{
name <- gsub("\\(\\w*\\)", "", entry[s1]) # get name
quantity <- as.numeric(unlist(strsplit(entry[s1], "[()]"))[2]) # get value
frame[1, which(colnames(frame)==name)] <- quantity # store
}
return(frame)
}
values(entry, types) # This is how I want the output to look
I have tried the following to split the array, but I can't figure out how to get adply to return a single row.
types <- c("one", "two", "three") # data
entry <- c("one(1)", "three(2)") # data
frame<- setNames(as.data.frame(matrix(0, ncol = length(types), nrow = 1)), types)
array_split <- function(entry, frame){
name <- gsub("\\(\\w*\\)", "", entry) # get name
quantity <- as.numeric(unlist(strsplit(entry, "[()]"))[2]) # get value
frame[1, which(colnames(frame)==name)] <- quantity # store
return(frame)
}
adply(entry, 1, array_split, frame)
Is there something like cumsum I should be considering? I want to complete the operation quickly.
I'm not sure why you aren't just doing something more like this:
frame <- setNames(rep(0,length(types)),types)
a <- as.numeric(sapply(strsplit(entry,"[()]"),`[[`,2))
names(a) <- gsub("\\(\\w*\\)", "", entry)
frame[names(a)] <- a
Both gsub and strsplit are already vectorized, so there's no real need for explicit loop anywhere. You only need the sapply to extract the second element of the strsplit results. The rest is just regular indexing.

R loops: Adding a column to a table if does not already exist

I am trying to compile data from several files using for loops in R. I would like to get all the data into one table. Following calculation is just an example.
library(reshape)
dat1 <- data.frame("Specimen" = paste("sp", 1:10, sep=""), "Density_1" = rnorm(10,4,2), "Density_2" = rnorm(10,4,2), "Density_3" = rnorm(10,4,2))
dat2 <- data.frame("Specimen" = paste("fg", 1:10, sep=""), "Density_1" = rnorm(10,4,2), "Density_2" = rnorm(10,4,2))
dat <- c("dat1", "dat2")
for(i in 1:length(dat)){
data <- get(dat[i])
melt.data <- melt(data, id = 1)
assign(paste(dat[i], "tbl", sep=""), cast(melt.data, ~ variable, mean))
}
rbind(dat1tbl, dat2tbl)
What is the smoothest way to add an extra column into dat2? I would like to get the same column name ("Density_3" in this case) and fill it up with zeros, if it does not already exist. Assume that I have ~100 tables with number of columns (Density_1, 2, 3 etc) varying between 5 and 6.
I tried following, but it didn't work:
if(names(data) %in% "Density_3" == FALSE){
dat.all$Density_3 <- 0
} else {
dat.all$Density_3 <- dat.all$Density3}
Another one: is there a smooth way to rbind() the tables? It seems that rbind(get(dat)) does not work.
After staring at this question for a while I think its intent may have been obscured by the unnecessary get and assign manipulations. And I think the answer is pylr::rbind.fill
I would have constructed "dat", not as a character vector but as a list of two dataframes, used aggregate( ..., FUN=mean) (because I haven't gotten on the reshape2/plyr bus, except for melt and rbind.fill that is ) and then do.call(rbind.fill, ...) on the resulting list. At any rate this is what I think you want. I do not think it is a good idea to add in zeros for what are really missing values.
> rbind.fill(dat1tbl, dat2tbl)
value Density_1 Density_2 Density_3
1 (all) 5.006709 4.088988 2.958971
2 (all) 4.178586 3.812362 NA

Resources