Data.frame row calculation - r

I want to calculate rows in my data.frame with a simple function (for e.g. [sqrt(column1 * column 2)]). This is my function. I have 17 rows in which I want to calculate the function to create a new column called d.
How to make it? With combine? With t(x) transfering the data.frame into a matrix? Or with which function? I want it to still have a data.frame (as a table).

You have several options to choose from:
Base R
df$d <- sqrt(df$column1^2 +df$column2^2)
#or
transform(df, d=sqrt(column1^2+column2^2))
Tidyverse
library(tidyverse)
df <- df %>%
mutate( d = sqrt(column1^2+column2^2))
head(df)
All these methods preserve your data as a data frame

Related

Unnest a ts class

My data has multiple customers data with different start and end dates along with their sales data.So I did simple exponential smoothing.
I applied the following code to apply ses
library(zoo)
library(forecast)
z <- read.zoo(data_set,FUN = function(x) as.Date(x) + seq_along(x) / 10^10 , index = "Date", split = "customer_id")
L <- lapply(as.list(z), function(x) ts(na.omit(x),frequency = 52))
HW <- lapply(L, ses)
Now my output class is list with uneven lengths.Can someone help me how to unnest or unlist the output in to a data frame and get the fitted values,actuals,residuals along with their dates,sales and customer_id.
Note : the reson I post my input data rather than data of HW is,the HW data is too large.
Can someone help me in R.
I would use tidyverse package to handle this problem.
map(HW, ~ .x %>%
as.data.frame %>% # convert each element of the list to data.frame
rownames_to_column) %>% # add row names as columns within each element
bind_rows(.id = "customer_id") # bind all elements and add customer ID
I am not sure how to relate dates and actual sales to your output (HW). If you explain it I might provide solution to that part of the problem too.
Firstly took all the unique customer_id into a variable called 'k'
k <- unique(data_set$customer_id)
Created a empty data frame
b <- data.frame()
extracted all the fitted values using a for loop and stored in 'a'.Using the rbind function attached all the fitted values to data frame 'b'
for(key in k){
print(a <- as.data.frame((as.numeric(HW_ses[[key]]$model$fitted))))
b <- rbind(b,a)
}
Finally using column bind function attached the input data set with data frame 'b'
data_set_final <- cbind(data_set,b)

R - Averaging large matrix

I currently have a large matrix, with 72 rows and 919 columns.
amatrix <- matrix(rexp(919, rate=.1), ncol=919, nrow=72)
As this is a data frame containing technical replicates, I must first average the values for the technical replicates, prior to further analysis. The technical replicates are sequential (rows), in groups of 3.
Is there a way to average 3 rows at a time together, to result in a new matrix with 24 rows and 919 columns?
I have been doing this part manually so far and importing the data back into R. There must be a way to do this in R, but I can't find a similar answer.
I believe the key thing is to know how to describe the pattern using R code, e.g.
rep(1:(nrow(amatrix)/3), each=3)
Then it's simply a matter of group-level aggregation. You can do this with any base, dplyr, data.table, or other aggregation method.
Let's start with base R.
I prefer to work with this as a data.frame, but you could also keep it as a matrix and just use [] indexing instead of $ to create a new vector:
amatrix <- as.data.frame(matrix(rexp(919, rate=.1), ncol=919, nrow=72))
amatrix$technical_rep_number <- rep(1:(nrow(amatrix)/3), each=3)
Creation of this vector is actually entirely optional. You could also leave your matrix as-is and just specify the pattern (rep(1:(nrow(amatrix)/3), each=3), in this case) within the aggregation function.
From base R we can use aggregate:
new_table <- aggregate(amatrix, by=list(amatrix$technical_rep_number), mean)
nrow(new_table)
24
In dplyr we can use group_by and summarize:
new_table <- amatrix %>%
group_by(technical_rep_number) %>%
summarize(mean1 = mean(V1)) # etc
You can also take the means of all of the columns at once like this:
new_table <- amatrix %>%
group_by(technical_rep_number) %>%
summarise_each(funs(mean))
Note that summarise_each() has been deprecated however, so I recommend summarize_all():
new_table <- amatrix %>%
group_by(technical_rep_number) %>%
summarize_all(funs(mean))

using adist on two columns of data frame

I want to use adist to calculate edit distance between the values of two columns in each row.
I am using it in more-or-less this way:
A <- c("mad","car")
B <- c("mug","cat")
my_df <- data.frame(A,B)
my_df$dist <- adist(my_df$A, my_df$B, ignore.case = TRUE)
my_df <- my_df[order(dist),]
The last two rows are the same as in my case, but the actual data frame looks a bit different - columns of my original data frame are character type, not factor. Also, the dist column seems to be returned as 2-column matrix, I have no idea why it happens.
Update:
I have read a bit and found that I need to apply it over the rows, so my new code is following:
apply(my_df, 1, function(d) adist(d[1], d[2]))
It works fine, but for my original dataset calling it by column numbers is inpractical, how can I refer to column names in this function?
Using tidyverse approach, you may use the following code:
library(tidyverse)
A <- c("mad","car")
B <- c("mug","cat")
my_df <- data.frame(A,B)
my_df %>%
rowwise() %>%
mutate(Lev_dist=adist(x=A,y=B,ignore.case=TRUE))
You can overcome that problem by using mapply, i.e.
mapply(adist, df$A, df$B)
#[1] 2 1
As per adist function definition the x and y arguments should be character vectors. In your example the function is returning a 2x2 matrix because it is comparing also the cross words "mad" with "cat" and "car" with "mug".
Just look at the matrix master diagonal.

R: Apply function on specific columns preserving the rest of the dataframe

I'd like to learn how to apply functions on specific columns of my dataframe without "excluding" the other columns from my df. For example i'd like to multiply some specific columns by 1000 and leave the other ones as they are.
Using the sapply function for example like this:
a<-as.data.frame(sapply(table.xy[,1], function(x){x*1000}))
I get new dataframes with the first column multiplied by 1000 but without the other columns that I didn't use in the operation. So my attempt was to do it like this:
a<-as.data.frame(sapply(table.xy, function(x) if (colnames=="columnA") {x/1000} else {x}))
but this one didn't work.
My workaround was to give both dataframes another row with IDs and later on merge the old dataframe with the newly created to get a complete one. But I think there must be a better solution. Isn't it?
If you only want to do a computation on one or a few columns you can use transform or simply do index it manually:
# with transfrom:
df <- data.frame(A = 1:10, B = 1:10)
df <- transform(df, A = A*1000)
# Manually:
df <- data.frame(A = 1:10, B = 1:10)
df$A <- df$A * 1000
The following code will apply the desired function to the only the columns you specify.
I'll create a simple data frame as a reproducible example.
(df <- data.frame(x = 1, y = 1:10, z=11:20))
(df <- cbind(df[1], apply(df[2:3],2, function(x){x*1000})))
Basically, use cbind() to select the columns you don't want the function to run on, then use apply() with desired functions on the target columns.
In dplyr we would use mutate_at in which you can select or exclude (by preceding variable name with "-" minus sign) specific variables.
You can just name a function
df <- df %>%
mutate_at(vars(columnA), scale)
or create your own
df <- df %>%
mutate_at(vars(columnA, columnC), function(x) {do this})

Creating multiple subsets all in one data.frame (possibly with ddply)

I have a large data.frame, and I'd like to be able to reduce it by using a quantile subset by one of the variables. For example:
x <- c(1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10)
df <- data.frame(x,rnorm(100))
df2 <- subset(df, df$x == 1)
df3 <- subset(df2, df2[2] > quantile(df2$rnorm.100.,0.8))
What I would like to end up with is a data.frame that contains all quantiles for x=1,2,3...10.
Is there a way to do this with ddply?
You could try:
ddply(df, .(x), subset, rnorm.100. > quantile(rnorm.100., 0.8))
And off topic: you could use df <- data.frame(x,y=rnorm(100)) to name a column on-the-fly.
Here's a different approach with the little used ave() command. (very fast to calculate this way)
Make a new column that contains the quantile calculation across each level of x
df$quantByX <- ave(df$rnorm.100., df$x, FUN = function (x) quantile(x,0.8))
Select the items of the new column and the x column.
df2 <- unique(df[,c(1,3)])
The result is one data frame with the unique items in the x column and the calculated quantile for each level of x.

Resources