How to lappy() over selective columns? - R - r

I am a novice R programmer. I am wondering how to lappy over a dataframe but avoiding certain columns.
# Some dummy dataframe
df <- data.frame(
grp = c("A", "B", "C", "D"),
trial = as.factor(c(1,1,2,2)),
mean = as.factor(c(44,33,22,11)),
sd = as.factor(c(3,4,1,.5)))
df <- lapply(df, function (x) {as.numeric(as.character(x))})
However, the method I used introduces NAs by coercion.
Would there to selectively (or deselectively) lapply over the dataframe while maintaining the integrity of the dataframe?
In other words, would there be a way to convert only mean and sd to numerics? (In general form)
Thank you

Try doing this:
df[,3:4] <- lapply(df[,3:4], function (x) {as.numeric(as.character(x))})
You are simply passing function to the specified columns. You can also provide a condition to select subset of your columns, something like excluding the ones you don't want to cast.
col = names(df)[names(df)!=c("grp","trial")]
df[,col] <- lapply(df[,col], function (x) {as.numeric(as.character(x))})

Well as you might have guessed, there are many ways. Since you seem to be doing in place substitution, actually, a for loop would be suitable.
df <- data.frame(
grp = c("A", "B", "C", "D"),
trial = as.factor(c(1,1,2,2)),
mean = as.factor(c(44,33,22,11)),
sd = as.factor(c(3,4,1,.5)))
my_cols <- c("trial", "mean", "sd")
for(mc in my_cols) {
df[[mc]] <- as.numeric(as.character(df[[mc]]))
}

If you want to convert selectively by column names:
library(dplyr)
df %>%
mutate_if(names(.) %in% c("mean", "sd"),
function(x) as.numeric(as.character(x)))

Related

Retrieve number of factor levels from columns within a function in R

I am trying to create a function that performs several statistical tests on specific columns in a dataframe. Some of the tests require more than one level. I would like to test how many levels are in a specific column, but can't seem to get it right.
In my actual code this section would be followed by an ifelse that returns a string saying 'only one level' if single, or continues to the statistical test if > 1.
require("dplyr")
df <- data.frame(A = c("a", "b", "c"), B = c("a", "a", "a"), C = c("a", "b", "b")) %>%
mutate(A = factor(A)) %>%
mutate(B = factor(B)) %>%
mutate(C = factor(C))
my_funct <- function(data_f, column){
n_fact <- paste("data_f", column, sep = "$")
n_levels <- do.call("nlevels",
list(x = as.name(n_fact)))
print(n_levels)
}
```
Then I call my function with the dataframe and column
my_funct(df, "A")
I get the following error:
Error in levels(x) : object 'data_f$A' not found
If I remove the as.name() wrapper it returns a value of 0.
One reason your code is not working is because data_f$A is not the name of any object available to the function.
But I would recommend you don't even try to parse code as strings. It's the wrong way to do it. All you need is double bracket indexing [[. So the body of your function can be the following single line:
nlevels(data_f[[column]])
And for all the columns:
sapply(data_f, nlevels)

One liner wanted: Create data frame and give colnames: R data.frame(..., colnames = c("a", "b", "c"))

Is there an easier (i.e. one line of code instead of two!) way to do the following:
results <- as.data.frame(str_split_fixed(c("SampleID_someusefulinfo.countsA" , "SampleID_someusefulinfo.countsB" , "SampleID_someusefulinfo.counts"), "\\.", n=2))
names(results) <- c("a", "b")
Something like:
results <- data.frame(str_split_fixed(c("SampleID_someusefulinfo.countsA" , "SampleID_someusefulinfo.countsB" , "SampleID_someusefulinfo.counts"), "\\.", n=2), colnames = c("a", "b"))
I do this a lot, and would really love to have a way to have this in one line of code.
/data.table works too, if it's easier to do there than in base data.frame/
Clarifying:
My expected output (which is achieved by running the two lines of code at the top - AND I WANT IT TO BE ONE - THAT's IT!!!) is a result data frame of the structure:
results
a b
1 SampleID_someusefulinfo countsA
2 SampleID_someusefulinfo countsB
3 SampleID_someusefulinfo counts
What I would like to do is:
CREATE the data frame from a matrix or with some content (for example the toy code of matrix(c(1,2,3,4),nrow=2,ncol=2) I provided in the first example I wrote)
SPECIFY IN THAT SAME LINE what I would like the column names of my data frame to be
Use setNames() around a data.frame
setNames(data.frame(matrix(c(1,2,3,4),nrow=2,ncol=2)), c("a","b"))
# a b
#1 1 3
#2 2 4
?setNames:
a convenience function that sets the names on an object and returns the object
> setNames
function (object = nm, nm)
{
names(object) <- nm
object
}
We can use the dimnames option in matrix as the OP was using matrix to create the data.
data.frame(matrix(1:4, 2, 2, dimnames=list(NULL, c("a", "b"))))
Or
`colnames<-`(data.frame(matrix(1:4, 2, 2)), c('a', 'b'))

How can I make list of R dataframe columns conditional on value of another column

Lets say that I have a dataframe:
df <- data.frame(VAR1 = c(1,2,3,4,5,6), VAR2 = c("A","A","A","B","B","B"))
and I want to make a list of VAR1 values grouped by each VAR2 level:
myList <- list(c(1,2,3), c(4,5,6))
I can use:
myList <- list(df[df$VAR2 == "A", ]$VAR1, df[df$VAR2 == "B", ]$VAR1)
Ideally though I'd like to use more straightforward solution w/o hardcoding because I have larger data with many levels in the factor variable.
We can use split
split(df$VAR1, df$VAR2)

Systematic replace part of variable name with 1st element of an associated R vector

I have a dataframe in which the 1st element of an associated 'name' vector is related to subsequent named numerical vectors. I am attempting to replace the meaningless number with the 1st element of the associated name vector.
Here is an example dataframe:
df <- data.frame(data.0.name = c("A", "A", "A"), data.0.one_minute_ago = c(1,2,1), data.0.one_hour_ago = c(2,2,3),
data.1.name = c("B", "B", "B"), data.1.one_minute_ago = c(3,3,2), data.1.one_hour_ago = c(5,6,2))`
Each number.name vector is associated with a construct (either A or B in this case) and each number.time is associated with a time dimension. So, data.0.one_minute_ago is actually the number of A's you had one_minute_ago.
What I would like to do (because I have a large dataset with lots of the transformations) is to replace the number.dimension with the construct.dimension, and of course do that for each number. from 0:9
I've written some grep code to begin with this task, but to no avail (I am stuck with retaining everything after the number.
grep( "data.[0-9].name" ,names(df), perl=TRUE)
as.character(df[1, 1])
as.character(df[1, 4])
as.character(names(df[2]))
as.character(names(df[3]))
as.character(names(df[5]))
as.character(names(df[6]))
df.1 <- (df[1, grep( "data.[0-9].name" ,names(df))])
df.1 <- (df[1, grep( "data.[0-9].name" ,names(df))])
df.1 <- data.frame(lapply(df.1, as.character), stringsAsFactors=FALSE)
constructs <- as.character(df.1[1,c(1:2)])
Here the 1st and 2nd element of constructs are the constructs associated with 0.name/0.dimension and 1.name/1.dimension respectively.
constructs [1]
constructs [2]
From there, I'm fairly certain the code would involve some names(df)[] <- but am uncertain on where to go from here.
Any and all help appreciated.
EDIT: here is the desired variable name output: simply changing the variable names (and of course retain the values associated with the variable names:
data.A.name data.A.one_minute_ago data.A.one_hour_ago data.B.name data.B.one_minute_ago data.B.one_hour_ago
EDIT 2: In my true dataset, the number of repetitions per dimensions (i.e., one_minute_ago, one_hour_ago, one_day_ago) can vary across construct (i.e, two dimensions for one construct and 3 for another, and 9 for another). I would like the solution to take that into account.
Here is a modified sample dataset to reflect this subtlety:
df <- data.frame(data.0.name = c("A", "A", "A"), data.0.one_minute_ago = c(1,2,1), data.0.one_hour_ago = c(2,2,3),
data.1.name = c("B", "B", "B"), data.1.one_minute_ago = c(3,3,2), data.1.one_hour_ago = c(5,6,2),
data.2.name = c("C", "C", "C"), data.2.one_minute_ago = c(3,3,2), data.2.one_hour_ago = c(5,6,2), data.2.one_day_ago = c(3,2,3))
We create a grouping 'indx' based on the 'number' in the column names. split the column names based on the 'indx' ('lst'). Get one element from the columns having 'name' as suffix ('r1'). Use 'Map' and gsub to replace the 'number' in each element of 'lst' with that of 'r1'.
indx <- gsub('[^0-9]+', '', names(df))
lst <- split(names(df), indx)
r1 <- as.character(unlist(df[1,grep('name', names(df))]))
lst2 <- Map(function(x,y) gsub('[0-9]+', y, x), lst, r1)
names(df) <- unsplit(lst2, indx)
names(df)
# [1] "data.A.name" "data.A.one_minute_ago" "data.A.one_hour_ago"
#[4] "data.B.name" "data.B.one_minute_ago" "data.B.one_hour_ago"
#[7] "data.C.name" "data.C.one_minute_ago" "data.C.one_hour_ago"
#[10] "data.C.one_day_ago"
I think this works:
library(stringr)
splits <- str_split(names(df), "\\.")
trailing_name <- sapply(splits, "[[", 3)
constructs <- rep(constructs, each = 3)
constructs
# [1] "A" "A" "A" "B" "B" "B"
names(df) <- str_c("data", constructs, trailing_name, sep=".")
names(df)
# [1] "data.A.name" "data.A.one_minute_ago" "data.A.one_hour_ago" "data.B.name"
# [5] "data.B.one_minute_ago" "data.B.one_hour_ago"

Merge rows with condition and limit in a dataframe

I have the following dummy dataset of 1000 observations:
obs <- 1000
df <- data.frame(
a=c(1,0,0,0,0,1,0,0,0,0),
b=c(0,1,0,0,0,0,1,0,0,0),
c=c(0,0,1,0,0,0,0,1,0,0),
d=c(0,0,0,1,0,0,0,0,1,0),
e=c(0,0,0,0,1,0,0,0,0,1),
f=c(10,2,4,5,2,2,1,2,1,4),
g=sample(c("yes", "no"), obs, replace = TRUE),
h=sample(letters[1:15], obs, replace = TRUE),
i=sample(c("VF","FD", "VD"), obs, replace = TRUE),
j=sample(1:10, obs, replace = TRUE)
)
One key feature of this dataset is that the variables a to e's values are only one 1 and the rest are 0. We are sure the only one of these five columns have a 1 as value.
I found a way to extract these rows given a condition (with a 1) and assign to their respective variables:
df.a <- df[df[,"a"] == 1,,drop=FALSE]
df.b <- df[df[,"b"] == 1,,drop=FALSE]
df.c <- df[df[,"c"] == 1,,drop=FALSE]
df.d <- df[df[,"d"] == 1,,drop=FALSE]
df.e <- df[df[,"e"] == 1,,drop=FALSE]
My dilemma now is to limit the rows saved into df.a to df.e and to merge them afterwards.
Here's a shorter way to create df.merged:
# variables of 'df'
vars <- c("a", "b", "c", "d", "e")
# number of rows to extract
n <- 100
df.merged <- do.call(rbind, lapply(vars, function(x) {
head(df[as.logical(df[[x]]), ], n)
}))
Here, rbind is sufficient. The function rbind.fill is necessary if your data frames differ with respect to the number of columns.
To get the n-rows subset, a simple data[1:n,] does the job.
df.a.sub <- df.a[1:10,]
df.b.sub <- df.b[1:10,]
df.c.sub <- df.c[1:10,]
df.d.sub <- df.d[1:10,]
df.e.sub <- df.e[1:10,]
Finally, merge them by (it took the most time to find a straightforward "merge multiple dataframes" and all I needed to do was rbind.fill(df1, df2, ..., dfn) thanks to this question and answer):
require(plyr)
df.merged <- rbind.fill(df.a.sub, df.b.sub, df.c.sub, df.d.sub, df.e.sub)

Resources