How to recode a set of variables in a dataframe in R - r

I have a dataframe with different variables containing values from 1 to 5. I want to recode some variables in the way that 5 becomes 1 and vice versa (x=6-x).
I want to define a list of variables, that will be recoded like this in my dataframe.
Here is my approach using lapply. I haven't really understood it yet.
#generate example-dataset
var1<-sample(1:5,100,rep=TRUE)
var2<-sample(1:5,100,rep=TRUE)
var3<-sample(1:5,100,rep=TRUE)
dat<-as.data.frame(cbind(var1,var2,var3))
recode.list<-c("var1","var3")
recode.function<- function(x){
x=6-x
}
lapply(recode.list,recode.function,data=dat)

There's no need for an external function or for a package for this. Just use an anonymous function in lapply, like this:
df[recode.list] <- lapply(df[recode.list], function(x) 6-x)
Using [] lets us replace just those columns directly in the original dataset. This is needed since just using lapply would result in the data as a named list.
As noted in the comments, you can actually even skip lapply:
df[recode.list] <- 6 - df[recode.list]

You can use mapvalues from plyr.
require(plyr)
# if you just want to replace 5 with 1 and vice versa
df[, recode.list] <- sapply(df[, recode.list], mapvalues, c(1, 5), c(5,1))
# if you want to apply to x=6-x to all values (in this case you don't need mapvalues)
df[, recode.list] <- sapply(df[, recode.list], mapvalues, 1:5, 5:1)

Here's an option to do this with dplyr:
recode.function<- function(x){
x <- 6-x
}
recode.list <- c("var1","var3")
require(dplyr)
df %>% mutate_each_(funs(recode.function), recode.list)
# var1 var2 var3
#1 2 2 4
#2 3 3 3
#3 3 5 2
#4 3 3 2
#5 4 3 3
#6 5 4 1
#...

Related

Stacking dataframes from output of iterated function in R

I have written a function and am trying to iterate so that each value from a list becomes the input to the function once.
repeat100 <- function(B) {
se100 <- replicate(100, bootandse(B))
bandse <- data.frame("B" = B, "SE" = se100)
}
The output of this function is a dataframe with one column for the value of B that was inputted and another column for the SE. Like this, but with 100 rows.
B SE
1 2
1 4
1 3
I have an object B with several values. I am trying to iterate repeat100() over each value of B.
B <- c(1, 4, 6, 20, 30)
How can I get the resulting output to be a dataframe that has the output of each repeat100() for each value of B stacked like this?
B SE
1 2
1 3
1 2
4 5
4 3
4 2
Use lapply with rbind to combine the data in one dataframe.
result <- do.call(rbind, lapply(B, repeat100))
Or purrr::map_df which is shorter.
purrr::map_df(B, repeat100)
We can use rbindlist from data.table
library(data.table)
rbindlist(lapply(B, repeat100))

data.table ifelse with multiple columns

I have a dataset with tens of columns that looks something like this:
df <- data.frame(id= c(1,1,1,2,2,2,3,3,3), time=c(1,2,3,1,2,3,1,2,3),y1 = rnorm(9), y2= rnorm(9), x = rnorm(9), xb = rnorm(9))
df
# id time y1 y2 x xb
# 1 1 1 -1.1184009 -1.07430118 0.61398523 -0.68343624
# 2 1 2 0.4347047 -0.53454071 -0.30716538 -1.02328242
# 3 1 3 0.2318315 -0.05854228 0.05169733 -0.22130149
# 4 2 1 1.2640080 2.07899296 -0.95918953 -0.35961156
# 5 2 2 -0.4374764 -0.25284854 -0.46251901 0.08630344
# 6 2 3 0.5042690 0.13322671 1.00881113 0.43807458
# 7 3 1 0.3672216 1.92995242 0.48708183 0.58206127
# 8 3 2 -1.5431709 0.53362731 1.17361087 -1.00932195
# 9 3 3 -1.4577268 0.23413541 -0.32399489 -0.91040641
I would like to modify my data frame using the following logic
df<-setDT(df)[,y1:=ifelse(y1>x,x,y1))]
df<-setDT(df)[,y2:=ifelse(y2>xb,xb,y2))]
However, since I have many variables I would like to do this in a single line expression. In other words, I would like to pass this function for multiple columns at once i.e. y1 with x, y2 with xb and so on...
I have tried the following but it does not seem to work
mod<-c("y1","y2")
max<-c("x","xb")
df2<-setDT(ppta)[,(mod):=ifelse(.(mod)>.(max),.(max),.(mod))]
does anyone knows what I am doing wrong? and how I modify multiple columns with their respective partner column at once?
Consider using pmin instead of your ifelse. You can try:
mod<-c("y1","y2")
max<-c("x","xb")
setDT(df)
df[,c(mod):=Map(pmin,mget(mod),mget(max))]
Explanation:
pmin takes two (or more) vectors and gives the minimum value for each element (equivalent of your ifelse(y1>x,x,y1));
mget returns a list of objects from their names. For instance mget("a","b") returns a list with the a and b objects (if they exist). This is used to retrieve the column from their name in the environment of the data table;
Map applies a function with more arguments element by element. Map(f,a,b) is equivalent to list(f(a[[1]],b[[1]]),f(a[[2]],b[[2]]),...).

between list calculations per row in R

Let's say i have the following list of df's (in reality i have many more dfs).
seq <- c("12345","67890")
li <- list()
for (i in 1:length(seq)){
li[[i]] <- list()
names(li)[i] <- seq[i]
li[[i]] <- data.frame(A = c(1,2,3),
B = c(2,4,6))
}
What i would like to do is calculate the mean within the same cell position between the lists, keeping the same amount of rows and columns as the original lists. How could i do this? I believe I can use the apply() function, but i am unsure how to do this.
The expected output (not surprising):
A B
1 1 2
2 2 4
3 3 6
In reality, the values within each list are not necessarily the same.
If there are no NAs, then we can Reduce to get the sum of observations for each element and divide by the length of the list
Reduce(`+`, li)/length(li)
# A B
#1 1 2
#2 2 4
#3 3 6
If there are NA values, then it may be better to use mean (which has na.rm argument). For this, we can convert it to array and then use apply
apply(array(unlist(li), dim = c(dim(li[[1]]), length(li))), c(1, 2), mean)
An equivalent option in tidyverse would be
library(tidyverse)
reduce(li, `+`)/length(li)

looping over each column to impute data in R but does not replace imputed data

I am trying to impute the dataframe with Hmisc impute model. I am able to impute the data for one column at a time but fail to loop over columns.
Below example - works fine but I would like to make it dynamic using a function:
impute_marks$col1 <- with(impute_marks, round(impute(col1, mean)),0)
Example:
impute_dataframe <- function()
{
for(i in 1:ncol(impute_marks))
{
impute_marks[is.na(impute_marks[,i]), i] <- with(impute_marks, round(impute(impute_marks[,i], mean)),0)
}
}
impute_dataframe
There is no error when I run the function but there is no imputed data as well to the dataset impute_marks.
Hmisc::impute is already a function, why not just use apply and save a for loop?:
library(Hmisc)
age1 <- c(1,2,NA,4)
age2 <- c(NA, 4, 3, 1)
mydf <- data.frame(age1, age2)
mydf
age1 age2
1 1 NA
2 2 4
3 NA 3
4 4 1
apply(mydf, 2, function(x) {round(impute(x, mean))})
age1 age2
1 1 3
2 2 4
3 2 3
4 4 1
EDIT: To keep mydf as a data.frame you could coherce it back like this:
mydf <- as.data.frame(mydf)
But what I'd do is use another package purrr which is nice set of tools around this apply/mapping idea. map_df for example will always return a data.frame object, there are a bunch of map_x that you can see with ?map
library(purrr)
map_df(mydf, ~ round(impute(., mean)))
I know it is preferred to use the base R functions, but purrr makes apply style operations so much easier.
We can use na.aggregate from zoo which can be applied directly on the dataset
library(zoo)
round(na.aggregate(mydf))
# age1 age2
#1 1 3
#2 2 4
#3 2 3
#4 4 1
or in each column separately with lapply
mydf[] <- lapply(mydf, function(x) round(na.aggregate(x)))
By default, na.aggregate gives the mean. But, we can change the FUN

Flatten list column in data frame with ID column

My data frame contains the output of a survey with a select multiple question type. Some cells have multiple values.
df <- data.frame(a=1:3,b=I(list(1,1:2,1:3)))
df
a b
1 1 1
2 2 1, 2
3 3 1, 2, 3
I would like to flatten out the list to obtain the following output:
df
a b
1 1 1
2 2 1
3 2 2
4 3 1
5 3 2
6 3 3
should be easy but somehow I can't find the search terms. thanks.
You can just use unnest from "tidyr":
library(tidyr)
unnest(df, b)
# a b
# 1 1 1
# 2 2 1
# 3 2 2
# 4 3 1
# 5 3 2
# 6 3 3
Using base R, one option is stack after naming the list elements of 'b' column with that of the elements of 'a'. We can use setNames to change the names.
stack(setNames(df$b, df$a))
Or another option would be to use unstack to automatically name the list element of 'b' with 'a' elements and then do the stack to get a data.frame output.
stack(unstack(df, b~a))
Or we can use a convenient function listCol_l from splitstackshape to convert the list to data.frame.
library(splitstackshape)
listCol_l(df, 'b')
Here's one way, with data.table:
require(data.table)
data.table(df)[,as.integer(unlist(b)),by=a]
If b is stored consistently, as.integer can be skipped. You can check with
unique(sapply(df$b,class))
# [1] "numeric" "integer"
Here's another base solution, far less elegant than any other solution posted thus far. Posting for the sake of completeness, though personally I would recommend akrun's base solution.
with(df, cbind(a = rep(a, sapply(b, length)), b = do.call(c, b)))
This constructs the first column as the elements of a, where each is repeated to match the length of the corresponding list item from b. The second column is b "flattened" using do.call() with c().
As Ananda Mahto pointed out in a comment, sapply(b, length) can be replaced with lengths(b) in the most recent version of R (3.2, if I'm not mistaken).
A base R approach might also be to create a new data.frame for each row and rbind it afterwards:
df <- data.frame(a=1:3,b=I(list(1,1:2,1:3)))
df
df <- lapply(seq_along(df$a), function(x){data.frame(a = df$a[[x]], b = df$b[[x]])})
df <- do.call("rbind", df)
df

Resources