Use R to compute ratio when columns holding JSON lists? - r

I have a data frame with 2 columns that holds JSON like:
I have 2 columns where each row in each column is a JSON.
df$col1[1] <- "[14,7,5,3,4,0,1,7,2,3,1,18,13,4,23,7,8,8,11,18,15,6,2,10,2,4,8,5,11,5,1,5,2,4,3,1,6,8,5,5,3,1,1,4,5,2,9,3,4,11,11,14,3,12,2,6,0,0,15,1,18,5,3,6,6,6]"
and a scalar column:
df$scalar <- 10, .... , 10
I want to apply the following formula:
((fromJSON(df$col1) / scalar1) / (fromJSON(df$col2) / scalar2))
I have done something like this:
lapply(df$col1, function(i) {fromJSON(i)/scalar1}) /
lapply(df$col2, function(i) {fromJSON(i)/scalar2}
Is there any other way to do this?

We can loop through the columns that in JSON format apply fromJSON, then divide both with scalar using Map and Reduce it to a single vector
Reduce(`/`, Map(`/`, lapply(df[c('col1', 'col2')], fromJSON), df[c('scalar1', 'scalar2')]))
A similar approach using map from purrr would be
library(purrr)
map2(df[c('col1', 'col2')], df[c('scalar1', 'scalar2')], ~ fromJSON(.x)/.y) %>%
reduce(`/`)

Related

apply function to all meaningful combination in a list and save in matrix format

I want to apply a function (distance between distance matrices) to a list of distance matrices and extract the calculated value in a matrix/table format.
Using ecodistpackage for the calculation and nested lapply to make all possible combination of the MRMdistance calculation.
Part 1:
library("ecodist")
#example data
data(graze)
#make list to get it looking like my data
grazelist<-as.list.data.frame(graze)
#all vs all distance combination
grazedist<-lapply(names (grazelist),function(z)
lapply(names(grazelist),function(f)
MRM(dist(grazelist [[z]])~ dist(grazelist[[f]]),nperm=1)))
This makes all vs. all possible combinations, but I only need all combination in one way (half the matrix), as calculation takes very long and the result for the second half is the same. Any idea how to solve this?
Part 2:
I would like to gather only the dist value under $coefper calculation in a matrix format for follow up processing. In this case 1.000000e+00.
R output looks like this:
> head(grazedist[[1]])
[[1]]
[[1]]$`coef`
dist(grazelist[[z]]) pval
Int 8.881784e-16 1
dist(grazelist[[f]]) 1.000000e+00 1
[[1]]$r.squared
R2 pval
1 1
[[1]]$F.test
F F.pval
3.753766e+18 1.000000e+00
I know how to get it as txt or csv file for a simple R output not processed from a list (and called MRM_calculation).
write.table(MRM_calculation$coef[2,1],file="file.txt")
But how can I collect all dist values of $coef in a data frame, table or directly matrix format from a list?
something like:
mapply(write.csv2, x=grazedist$coef,
file=paste(names(grazedist),"value.csv"))
I´m using larger matrices (1500x1500) in a list, but hope the example data graze are sufficient as a reproducible example.
You could use combn which generates for your example data only 351 combinations rather than 729 combination (27 X 27) using nested lapply. You can then apply the function to every combination using the FUN argument in combn and extract the dist value in coef and write it to a dataframe.
library(ecodist)
df <- data.frame(value = combn(names(grazelist), 2, function(x)
MRM(dist(grazelist[[x[1]]])~ dist(grazelist[[x[2]]]),nperm=1)$coef[[2]]))
and then write this to csv
write.csv(df, "/path/to/file/filename.csv", row.names = FALSE)

Looping a rep() function in r

df is a frequency table, where the values in a were reported as many times as recorded in column x,y,z. I'm trying to convert the frequency table to the original data, so I use the rep() function.
How do I loop the rep() function to give me the original data for x, y, z without having to repeat the function several times like I did below?
Also, can I input the result into a data frame, bearing in mind that the output will have different column lengths:
a <- (1:10)
x <- (6:15)
y <- (11:20)
z <- (16:25)
df <- data.frame(a,x,y,z)
df
rep(df[,1], df[,2])
rep(df[,1], df[,3])
rep(df[,1], df[,4])
If you don't want to repeat the for loop, you can always try using an apply function. Note that you cannot store it in a data.frame because the objects are of different lengths, but you could store it in a list and access the elements in a similar way to a data.frame. Something like this works:
df2<-sapply(df[,2:4],function(x) rep(df[,1],x))
What this sapply function is saying is for each column in df[,2:4], apply the rep(df[,1],x) function to it where x is one of your columns ( df[,2], df[,3], or df[,4]).
The below code just makes sure the apply function is giving the same result as your original way.
identical(df2$x,rep(df[,1], df[,2]))
[1] TRUE
identical(df2$y,rep(df[,1], df[,3]))
[1] TRUE
identical(df2$z,rep(df[,1], df[,4]))
[1] TRUE
EDIT:
If you want it as a data.frame object you can do this:
res<-as.data.frame(sapply(df2, '[', seq(max(sapply(df2, length)))))
Note this introduces NAs into your data.frame so be careful!

Efficient Combination and Operating on Large Data Frames

I have 2 relatively large data frames in R. I'm attempting to merge / find all combos, as efficiently as possible. The resulting df turns out to be huge (the length is dim(myDF1)[1]*dim(myDF2)[1]), so I'm attempting to implement a solution using ff. I'm also open to using other solutions, such as the bigmemory package to work-around these memory issues. I'm have virtually no experience with either of these packages.
Working example - assume I'm working with some data frame that looks similar to USArrests:
library('ff')
library('ffbase')
myNames <- USArrests
myNames$States <- rownames(myNames)
rownames(myNames) <- NULL
Now, I will fabricate 2 data frames, which represent some particular sets of observations from myNames. I'm going to try to reference them by their rownames later.
myDF1 <- as.ffdf(as.data.frame(matrix(as.integer(rownames(myNames))[floor(runif(3*1e5, 1, 50))], ncol = 3)))
myDF2 <- as.ffdf(as.data.frame(matrix(as.integer(rownames(myNames))[floor(runif(2*1e5, 1, 50))], ncol = 2)))
# unique combos:
myDF1 <- unique(myDF1)
myDF2 <- unique(myDF2)
For example, my first set of states in myDF1 are myNames[unlist(myDF1[1, ]), ]. Then I will find all combos of myDF1 and myDF2 using ikey :
# create keys:
myDF1$key <- ikey(myDF1)
myDF2$key <- ikey(myDF2)
startTime <- Sys.time()
# Create some huge vectors:
myVector1 <- ffrep.int(myDF1$key, dim(myDF2)[1])
myVector2 <- ffrep.int(myDF2$key, dim(myDF1)[1])
# This takes about 25 seconds on my machine:
print(Sys.time() - startTime)
# Sort one DF (to later combine with the other):
myVector2 <- ffsorted(myVector2)
# Sorting takes an additional 2.5 minutes:
print(Sys.time() - startTime)
1) Is there a faster way to sort this?
# finally, find all combinations:
myDF <- as.ffdf(myVector1, myVector2)
# Very fast:
print(Sys.time() - startTime)
2) Is there an alternative to this type of combination (without using RAM)?
Finally, I'd like to be able to reference any of the original data by row / column. Specifically, I'd like to get different types of rowSums. For example:
# Here are the row numbers (from myNames) for the top 6 sets of States:
this <- cbind(myDF1[myDF[1:6,1], -4], myDF2[myDF[1:6,2], -3])
this
# Then, the original data for the first set of States is:
myNames[unlist(this[1,]),]
# Suppose I want to get the sum of the Urban Population for every row, such as the first:
sum(myNames[unlist(this[1,]),]$UrbanPop)
3) Ultimately, I'd like a vector with the above rowSum, so I can perform some type of subset on myDF. Any advice on how to most efficiently accomplish this?
Thanks!
It's pretty much unclear to me what you intent to do with the rowSum and your 3) element but if you want an efficient and RAM-friendly combination of 2 ff vectors, to get all combinations, you can use expand.ffgrid from ffbase.
The following will generate your ffdf with dimensions 160Mio rows x 2 columns in a few seconds.
require(ffbase)
x <- expand.ffgrid(myDF1$key, myDF2$key)

R - Summation of data frame columns changes data type

I have a data frame of 15 columns where the first column is an integer and others are numeric. I have to generate a one-liner summary of the sum of all columns except the last one. I need to generate mean of the last column. So, I am doing something as below:
summary <- c(sum(df$col1), ... mean(df$col15))
The summary then appears with values up to two decimal places even for the integer column (first one). I have been trying the round function to fix this. I can understand, when different types are added, e.g. 1 + 1.0. But, in this case, shouldn't the summation maintain the data-type?
Please let me know what am I missing?
If you are looking for a one-line summary:
lst <- c(lapply(df[-ncol(df)], function(x) sum(x)), mean=mean(df[,ncol(df)]))
as.data.frame(lst)
# int num1 mean
#1 10 6 2.5
The output is a data frame that preserves the classes of each vector. If you would like the output to be added to the original data frame you can replace as.data.frame(lst) with:
names(lst) <- names(df)
rbind(df, lst)
If you are trying to get the sum of all integer columns and the mean of numeric columns, go with #Frank's answer.
Data
df <- data.frame(int=1:4, num1=seq(1,2,length.out=4), num2=seq(2,3,length.out=4))
Perhaps an adaptation of this?
apply(iris[,1:4], 2, sum) / c(rep(1,3), nrow(iris))

How to apply operation and sum over columns in R?

I want to apply some operations to the values in a number of columns, and then sum the results of each row across columns. I can do this using:
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$a2 <- x$a^2
x$b2 <- x$b^2
x$result <- x$a2 + x$b2
but this will become arduous with many columns, and I'm wondering if anyone can suggest a simpler way. Note that the dataframe contains other columns that I do not want to include in the calculation (in this example, column sample is not to be included).
Many thanks!
I would simply subset the columns of interest and apply everything directly on the matrix using the rowSums function.
x <- data.frame(sample=1:3, a=4:6, b=7:9)
# put column indices and apply your function
x$result <- rowSums(x[,c(2,3)]^2)
This of course assumes your function is vectorized. If not, you would need to use some apply variation (which you are seeing many of). That said, you can still use rowSums if you find it useful like so. Note, I use sapply which also returns a matrix.
# random custom function
myfun <- function(x){
return(x^2 + 3)
}
rowSums(sapply(x[,c(2,3)], myfun))
I would suggest to convert the data set into the 'long' format, group it by sample, and then calculate the result. Here is the solution using data.table:
library(data.table)
melt(setDT(x),id.vars = 'sample')[,sum(value^2),by=sample]
# sample V1
#1: 1 65
#2: 2 89
#3: 3 117
You can easily replace value^2 by any function you want.
You can use apply function. And get those columns that you need with c(i1,i2,..,etc).
apply(( x[ , c(2, 3) ])^2, 1 ,sum )
If you want to apply a function named somefunction to some of the columns, whose indices or colnames are in the vector col_indices, and then sum the results, you can do :
# if somefunction can be vectorized :
x$results<-apply(x[,col_indices],1,function(x) sum(somefunction(x)))
# if not :
x$results<-apply(x[,col_indices],1,function(x) sum(sapply(x,somefunction)))
I want to come at this one from a "no extensions" R POV.
It's important to remember what kind of data structure you are working with. Data frames are actually lists of vectors--each column is itself a vector. So you can you the handy-dandy lapply function to apply a function to the desired column in the list/data frame.
I'm going to define a function as the square as you have above, but of course this can be any function of any complexity (so long as it takes a vector as an input and returns a vector of the same length. If it doesn't, it won't fit into the original data.frame!
The steps below are extra pedantic to show each little bit, but obviously it can be compressed into one or two steps. Note that I only retain the sum of the squares of each column, given that you might want to save space in memory if you are working with lots and lots of data.
create data; define the function
grab the columns you want as a separate (temporary) data.frame
apply the function to the data.frame/list you just created.
lapply returns a list, so if you intend to retain it seperately make it a temporary data.frame. This is not necessary.
calculate the sums of the rows of the temporary data.frame and append it as a new column in x.
remove the temp data.table.
Code:
x <- data.frame(sample=1:3, a=4:6, b=7:9); square <- function(x) x^2 #step 1
x[2:3] #Step 2
temp <- data.frame(lapply(x[2:3], square)) #step 3 and step 4
x$squareRowSums <- rowSums(temp) #step 5
rm(temp) #step 6
Here is an other apply solution
cols <- c("a", "b")
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$result <- apply(x[, cols], 1, function(x) sum(x^2))

Resources