Split Apply Combine

Split Apply Combine - r

I have a large list, and would like to apply the exact technique detailed in the answer here:
Create mutually exclusive dummy variables from categorical variable in R
However, my data is much larger, and I would like to split, apply and combine the operation to each individual row.
This code, which of course does not work, illustrates what I am trying to do:
id <- c(1,1,1,1)
time <- c(1,2,3,4)
time <- as.character(time)
unique.time <- as.character(unique(df$time))
df <- data.frame(id,time)
df1 <- split(df, row(df))
sapply(df1, (unique.time, function(x)as.numeric(df1$time == x)))
z <- unsplit(lapply(df1, row(df)), scale), x)
Thanks!

Related

R: Use apply correctly

I would like to create a simple code to join columns then count how many category variable has in my data frame. My problems is when I use apply, I get a right result but five times.
EXAMPLE:
a <- c('car','bike',NA,'moto','skate')
b <- c(NA,'car',NA,NA,'bike')
c <- c('car',NA,NA,'skate',NA)
d <- c('moto','skate',NA,'car',NA)
data <- data.frame(a,b,c,d)
then, using apply:
x <- vector('list',length = NCOL(data)*NROW(data))
one_column <- apply(data,1,function(y){
x <- rbind(y,x)
return(x)
})
Then unlist and use table for count how many categorical variables I have in my data:
one_column <- unlist(one_column)
table(one_column)
But for I get the right result I need divide by 5:
table(one_column)/5

The x vector you created is 5 times larger than each column, so it's recycling the data. What you want is this instead.
x <- vector('list',length = NCOL(data))
Or, like emilliman5 says, just use table(unlist(data)).

How to assign the output of a sapply loop to the original columns in a data frame without losing other columns

I a data frame with different columns that has string answers from different assessors, who used random upper or lower cases in their answers. I want to convert everything to lower case. I have a code that works as follows:
# Creating a reproducible data frame similar to what I am working with
dfrm <- data.frame(a = sample(names(islands))[1:20],
b = sample(unname(islands))[1:20],
c = sample(names(islands))[1:20],
d = sample(unname(islands))[1:20],
e = sample(names(islands))[1:20],
f = sample(unname(islands))[1:20],
g = sample(names(islands))[1:20],
h = sample(unname(islands))[1:20])
# This is how I did it originally by writing everything explicitly:
dfrm1 <- dfrm
dfrm1$a <- tolower(dfrm1$a)
dfrm1$c <- tolower(dfrm1$c)
dfrm1$e <- tolower(dfrm1$e)
dfrm1$g <- tolower(dfrm1$g)
head(dfrm1) #Works as intended
The problem is that as the number of assessors increase, I keep making copy paste errors. I tried to simplify my code by writing a function for tolower, and used sapply to loop it, but the final data frame does not look like what I wanted:
# function and sapply:
dfrm2 <- dfrm
my_list <- c("a", "c", "e", "g")
my_low <- function(x){dfrm2[,x] <- tolower(dfrm2[,x])}
sapply(my_list, my_low) #Didn't work
# Alternative approach:
dfrm2 <- as.data.frame(sapply(my_list, my_low))
head(dfrm2) #Lost the numbers
What am I missing?
I know this must be a very basic concept that I'm not getting. There was this question and answer that I simply couldn't follow, and this one where my non-working solution simply seems to work. Any help appreciated, thanks!

Maybe you want to create a logical vector that selects the columns to change and run an apply function only over those columns.
# only choose non-numeric columns
changeCols <- !sapply(dfrm, is.numeric)
# change values of selected columns to lower case
dfrm[changeCols] <- lapply(dfrm[changeCols], tolower)
If you have other types of columns, say logical, you also could be more explicit regarding the types of columns that you want to change. For example, to select only factor and character columns, use.
changeCols <- sapply(dfrm, function(x) is.factor(x) | is.character(x))

For your first attempt, if you want the assignments to your data frame dfrm2 to stick, use the <<- assignment operator:
my_low <- function(x){ dfrm2[,x] <<- tolower(dfrm2[,x]) }
sapply(my_list, my_low)
Demo

Overlay lines with a varying number of points from a list using ggplot2

I'd like to plot multiple lines with a varying number of points per line, with different colors using ggplot2. My MWE is given by
test <- list()
length(test) <- 10
for(i in 1:10){
test[[i]] <- rnorm(100 - i) # Note - different number of points per line!!!!
}
Note that The length for each vector in list are different. Then, is not possible to transform in data.frame.

So this gets you want you want I think. Note that it works on your list that has a different number of points per vector - which of course is one main reason why one would a list instead of a dataframes.
Most if not all of the examples on SO for this scenario are working with dataframes instead of data in lists. Since the vectors have different lengths, links that address this by melting a dataframe to a long form do not apply.
However if you did happen to have a dataframe, which implies a set of vectors of the same length, then you could use melt. However using gather from tidyr would probably be a more modern idiom for this than melt from reshape2. Note that melt can also be used on lists, although I would have to research how it handles the id.
I also choose not to use a function from the lapply class because I wanted to emphasis the "wide data" to "long data" aspect - something I think a for loop does far better that lapply, which beginning users can find mysterious.
Anyway we should probably be using something from purrr now as that is a modern type-stable functional library.
Here is some code - using a for loop, so not the most compact, but unrolled to make it easy and quick to understand:
library(ggplot2)
test <- list()
length(test) <- 10
for(i in 1:10){
test[[i]] <- rnorm(100 - i)
}
# Convert data to long form
df <- NULL
for(i in 1:10){
ydat <- test[[i]]
ndf <- data.frame(key=paste0("id",i),x=1:length(ydat),y=ydat)
df <- rbind(df,ndf)
}
# plot it
ggplot(df) + geom_line(aes(x=x,y=y,color=key))
Yielding:

As already pointed out by Mike Wise in his accepted answer, gplot2 requires a data.frame as input, preferably in long format.
However, both question and accepted answer used for loops although R has neat functions. To create the test data set, the following "one-liner" can been
used:
set.seed(1234L) # required to ensure reproducible data
test <- lapply(100L - 1:10, rnorm)
instead of
test <- list()
length(test) <- 10
for(i in 1:10){
test[[i]] <- rnorm(100 - i)
}
Note the use of set.seed() to ensure reproducible random data.
To reshape test from wide to long form, the whole list is turned into a data.frame at once using unlist(), adding the additional columns as required:
df <- data.frame(
id = rep(seq_along(test), lengths(test)),
x = sequence(lengths(test)),
y = unlist(test)
)
instead of turning each list element into a separate small data.frame and incrementally appending the pieces to a target data.frame using a for loop.
The plot is then created by
library(ggplot2)
ggplot(df) + geom_line(aes(x = x, y = y, color = as.factor(id)))
Alternatively, the melt() function has a method for lists:
library(data.table)
long <- melt(test, measure.vars = seq_along(test))
setDT(long)[, rn := rowid(L1)] # add row numbers for each group
ggplot(long) + aes(x = rn, y = value, color = as.factor(L1)) + geom_line()

As there were some remarks about the for loops, here is an alternate and more sophisticated approach in a modern idiom (i.e. purrr from the tidyverse).
Creates an id vector as a factor (ids) so as to avoid warnings about combining levels later.
Sets up a function (mkdf) to make a data frame from an id variable and a vector of data.
Uses map2 from purrr to merge ids and the original data list with mkdf
Uses bind_rows from dplyr to merge the resulting list of data frames into one.
Plots it.
The code:
library(tidyr)
# dummpy up some wide data (but of different lengths) in a **list** of curves
test <- list()
for(i in 1:5){
test[[i]] <- rnorm(10 - i)
}
# helper data (could do inline, but it would be harder to read)
ids <- as.factor(sprintf("id-%d",1:length(test))) # curve ids as factors
mkdf <- function(x,y) data.frame(xx=1:length(x),yy=x,key=y) # makes into dataframe
df <- test %>% map2(ids,mkdf) %>% bind_rows() #single pipe using purrr and dplyr
# plot it
ggplot(df) + geom_line(aes(x=xx,y=yy,color=key))
A plot. I reduced the datasizes to make it easier to see:

Subsetting efficiently on multiple columns and rows

I am trying to subset my data to drop rows with certain values of certain variables. Suppose I have a data frame df with many columns and rows, I want to drop rows based on the values of variables G1 and G9, and I only want to keep rows where those variables take on values of 1, 2, or 3. In this way, I aim to subset on the same values across multiple variables.
I am trying to do this with few lines of code and in a manner that allows quick changes to the variables or values I would like to use. For example, assuming I start with data frame df and want to end with newdf, which excludes observations where G1 and G9 do not take on values of 1, 2, or 3:
# Naive approach (requires manually changing variables and values in each line of code)
newdf <- df[which(df$G1 %in% c(1,2,3), ]
newdf <- df[which(newdf$G9 %in% c(1,2,3), ]
# Better approach (requires manually changing variables names in each line of code)
vals <- c(1,2,3)
newdf <- df[which(df$G1 %in% vals, ]
newdf <- df[which(newdf$G9 %in% vals, ]
If I wanted to not only subset on G1 and G9 but MANY variables, this manual approach would be time-consuming to modify. I want to simplify this even further by consolidating all of the code into a single line. I know the below is wrong but I am not sure how to implement an alternative.
newdf <- c(1,2,3)
newdf <- c(df$G1, df$G9)
newdf <- df[which(df$vars %in% vals, ]
It is my understanding I want to use apply() but I am not sure how.

You do not need to use which with %in%, it returns boolean values. How about the below:
keepies <- (df$G1 %in% vals) & (df$G9 %in% vals)
newdf <- df[keepies, ]

Use data.table
First, melt your data
library(data.table)
DT <- melt.data.table(df)
Then split into lists
DTLists <- split(DT, list(DT[1:9])) #this is the number of columns that you have.
Now you can operate on the lists recursively using lapply
DTresult <- lapply(DTLists, function(x) {
...
}

Creating multiple subsets all in one data.frame (possibly with ddply)

I have a large data.frame, and I'd like to be able to reduce it by using a quantile subset by one of the variables. For example:
x <- c(1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10)
df <- data.frame(x,rnorm(100))
df2 <- subset(df, df$x == 1)
df3 <- subset(df2, df2[2] > quantile(df2$rnorm.100.,0.8))
What I would like to end up with is a data.frame that contains all quantiles for x=1,2,3...10.
Is there a way to do this with ddply?

You could try:
ddply(df, .(x), subset, rnorm.100. > quantile(rnorm.100., 0.8))
And off topic: you could use df <- data.frame(x,y=rnorm(100)) to name a column on-the-fly.

Here's a different approach with the little used ave() command. (very fast to calculate this way)
Make a new column that contains the quantile calculation across each level of x
df$quantByX <- ave(df$rnorm.100., df$x, FUN = function (x) quantile(x,0.8))
Select the items of the new column and the x column.
df2 <- unique(df[,c(1,3)])
The result is one data frame with the unique items in the x column and the calculated quantile for each level of x.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Split Apply Combine - r

Related

R: Use apply correctly

How to assign the output of a sapply loop to the original columns in a data frame without losing other columns

Overlay lines with a varying number of points from a list using ggplot2

Subsetting efficiently on multiple columns and rows

Creating multiple subsets all in one data.frame (possibly with ddply)

Categories

Resources