R wildcards, sapply and as.factor - r

I want to change the type to factor of all variables in a data frame whose names match a certain pattern.
So here I am trying to change the type to factor of all variables whose name begins with namestub in the dataframe df.
attach(df)
sapply(grep(glob2rx("namestub*"), names(df)), as.factor)
But this doesn't work since
> levels(df$namestub1)
NULL

## Make a reproducible example
df <- data.frame(namestubA = letters[1:5], B = letters[5:1],
namestubC = LETTERS[1:5], stringsAsFactors=FALSE)
## Get indices of columns to convert
ii <- grep(glob2rx("namestub*"), names(df))
## Convert and replace the indicated columns
df[ii] <- lapply(df[ii], as.factor)

Related

How can lapply work with addressing columns as unknown variables?

So, I have a list of strings named control_for. I have a data frame sampleTable with some of the columns named as strings from control_for list. And I have a third object dge_obj (DGElist object) where I want to append those columns. What I wanted to do - use lapply to loop through control_for list, and for each string, find a column in sampleTable with the same name, and then add that column (as a factor) to a DGElist object. For example, for doing it manually with just one string, it looks like this, and it works:
group <- as.factor(sampleTable[,3])
dge_obj$samples$group <- group
And I tried something like this:
lapply(control_for, function(x) {
x <- as.factor(sampleTable[, x])
dge_obj$samples$x <- x
}
Which doesn't work. I guess the problem is that R can't recognize addressing columns like this. Can someone help?
Here are two base R ways of doing it. The data set is the example of help("DGEList") and a mock up data.frame sampleTable.
Define a vector common_vars of the table's names in control_for. Then create the new columns.
library(edgeR)
sampleTable <- data.frame(a = 1:4, b = 5:8, no = letters[21:24])
control_for <- c("a", "b")
common_vars <- intersect(control_for, names(sampleTable))
1. for loop
for(x in common_vars){
y <- sampleTable[[x]]
dge_obj$samples[[x]] <- factor(y)
}
2. *apply loop.
tmp <- sapply(sampleTable[common_vars], factor)
dge_obj$samples <- cbind(dge_obj$samples, tmp)
This code can be rewritten as a one-liner.
Data
set.seed(2021)
y <- matrix(rnbinom(10000,mu=5,size=2),ncol=4)
dge_obj <- DGEList(counts=y, group=rep(1:2,each=2))

How do you replace an entire column in one dataframe with another column in another dataframe?

I have two dataframes. I want to replace the ids in dataframe1 with generic ids. In dataframe2 I have mapped the ids from dataframe1 with the generic ids.
Do I have to merge the two dataframes and after it is merged do I delete the column I don't want?
Thanks.
With dplyr
library(dplyr)
left_join(df1, df2, by = 'ids')
We can use merge and then delete the ids.
dataframe1 <- data.frame(ids = 1001:1010, variable = runif(min=100,max = 500,n=10))
dataframe2 <- data.frame(ids = 1001:1010, generics = 1:10)
result <- merge(dataframe1,dataframe2,by="ids")[,-1]
Alternatively we can use match and replace by assignment.
dataframe1$ids <- dataframe2$generics[match(dataframe1$ids,dataframe2$ids)]
Subsetting data frames isn't very difficult in R: hope this helps, you didn't provide much code so I hope this will be of help to you:
#create 4 random columns (vectors) of data, and merge them into data frames:
a <- rnorm(n=100,mean = 0,sd=1)
b <- rnorm(n=100,mean = 0,sd=1)
c <- rnorm(n=100,mean = 0,sd=1)
d<- rnorm(n=100,mean = 0,sd=1)
df_ab <- as.data.frame(cbind(a,b))
df_cd <- as.data.frame(cbind(c,d))
#if you want column d in df_cd to equal column a in df_ab simply use the assignment operator
df_cd$d <- df_ab$a
#you can also use the subsetting with square brackets:
df_cd[,"d"] <- df_ab[,"a"]

Add different suffix to column names on multiple data frames in R

I'm trying to add different suffixes to my data frames so that I can distinguish them after I've merge them. I have my data frames in a list and created a vector for the suffixes but so far I have not been successful.
data2016 is the list containing my 7 data frames
new_names <- c("june2016", "july2016", "aug2016", "sep2016", "oct2016", "nov2016", "dec2016")
data2016v2 <- lapply(data2016, paste(colnames(data2016)), new_names)
Your query is not quite clear. Therefore two solutions.
The beginning is the same for either solution. Suppose you have these four dataframes:
df1x <- data.frame(v1 = rnorm(50),
v2 = runif(50))
df2x <- data.frame(v3 = rnorm(60),
v4 = runif(60))
df3x <- data.frame(v1 = rnorm(50),
v2 = runif(50))
df4x <- data.frame(v3 = rnorm(60),
v4 = runif(60))
Suppose further you assemble them in a list, something akin to your data2016using mgetand ls and describing a pattern to match them:
my_list <- mget(ls(pattern = "^df\\d+x$"))
The names of the dataframes in this list are the following:
names(my_list)
[1] "df1x" "df2x" "df3x" "df4x"
Solution 1:
Suppose you want to change the names of the dataframes thus:
new_names <- c("june2016", "july2016","aug2016", "sep2016")
Then you can simply assign new_namesto names(my_list):
names(my_list) <- new_names
And the result is:
names(my_list)
[1] "june2016" "july2016" "aug2016" "sep2016"
Solution 2:
You want to add the new_names literally as suffixes to the 'old' names, in which case you would use pasteor paste0 thus:
names(my_list) <- paste0(names(my_list), "_", new_names)
And the result is:
names(my_list)
[1] "df1x_june2016" "df2x_july2016" "df3x_aug2016" "df4x_sep2016"
You could use an index number within lapply to reference both the list and your vector of suffixes. Because there are a couple steps, I'll wrap the process in a function(). (Called an anonymous function because we aren't assigning a name to it.)
data2016v2 <- lapply(1:7, function(i) {
this_data <- data2016[[i]] # Double brackets for a list
names(this_data) <- paste0(names(this_data), new_names[i]) # Single bracket for vector
this_data # The renamed data frame to be placed into data2016v2
})
Notice in the paste0() line we are recycling the term in new_names[i], so for example if new_names[i] is "june2016" and your first data.frame has columns "A", "B", and "C" then it would give you this:
> paste0(c("A", "B", "C"), "june2016")
[1] "Ajune2016" "Bjune2016" "Cjune2016"
(You may want to add an underscore in there?)
As an aside, it sounds like you might be better served by adding the "june2016" as a column in your data (like say a variable named month with "june2016" as the value in each row) and combining your data using something like bind_rows() from the dplyr package, running it "long" instead of "wide".

Rename all other levels to "Other"

I have a dataframe containing all the calls that I have done in the last year. Under the column "Name" there are the names of the people in my contact list. In R this column contains 30 factors, I want to have only 3 factors: Mom, Dad, BestFriend and Others.
I'm using this snippet:
library(plyr)
call$Name <- mapvalues(call$Name, from = 'Mikey Mouse', to = 'BFF')
call$Name <- mapvalues(call$Name, from = c('Rocky Balboa','Uma Thurman'), to = c('Dad','Mom'))
How can I rename all other levels aside those 3 to Other?
We can first create a level 'Others' (assuming it is a factor), assign the levels that are not %in% the vector of levels ('nm1') to 'Other'
levels(call$Name) <- c(levels(call$Name), 'Other'))
levels(call$Name)[!levels(call$Name %in% nm1] <- 'Other'
Or another option is recode from dplyr which also have the .default option to specify other levels that are not in the vector to a given value
library(dplyr)
recode(call$Name, `Mikey Mouse` = 'BFF', `Rocky Balboa` = 'Dad',
`Uma Thurman` = 'Mom', .default = 'Other')
data
set.seed(24)
call <- data.frame(Name = sample(c('Mikey Mouse', 'Rocky Balboa',
'Uma Thurman', 'Richard Gere', 'Rick Perry'), 25, replace = TRUE))
nm1 <- c('Mickey Mouse', 'Rocky Balboa', 'Uma Thurman')
There is also the fct_other() function in the forcats package for doing exactly this. Using the data akrun provided we could simply do:
library(forcats)
call$Name <- fct_other(call$Name, keep = nm1)

Store output of sapply into a data frame?

how can I store the output of sapply() to a dataframe where the index value is stored in first column and its value in corresponding 2nd column. For illustration, I have shown only 2 elements here, but there are 110 columns in my data. "loan" is the data frame.
cols <- sapply(loan,function(x) sum(is.na(x)))
cols
id
0
member_id
7
I want output as:
var value
id 0
member_id 7
I know that sapply() returns a vector, but when I print the vector, values are printed along with its some "index" e.g., column name if applied on a data frame. So, now when I want to store it as a data frame with two columns where 1st column contains the index part and the second column contains the value, how can I do it?
I found an answer to my question. For those who actually did understand my problem, this answer might make sense:
cols <- data.frame(sapply(loan ,function(x) sum(is.na(x))))
cols <- cbind(variable = row.names(cols), cols)
I wanted the row.names to be in a column of the same data frame corresponding to the values obtained from sapply.
We can use stack
stack(mylist)[2:1]
data
mylist <- list(df = 1, rf = 2)
Is this what you want?
Your original list:
L <- c("df",1,"rf",2)
L
[1] "df" "1" "rf" "2"
As a data frame:
N <- length(L)
df <- data.frame( var = L[seq(1,N,2)], value = L[seq(2,N,2)] )
df
var value
1 df 1
2 rf 2

Resources