I have a column in a dataset that looks like this :
Actual
chr
5.25%
-5.50*1000000000
0.24%
-4.00*1000
4.5%
My goal is to access it and automatically convert the cells that have *1000 or *1000000000 and make the calculation, ex -5.5 * 1000000000 should be - 5 500 000 000 on the cell and -4 * 1000 should be -4000.
Does anyone have a hint how to do this?
Best regards
This can be done using first a splitting operation on * and then a mapping operation based on purrr's function map_dbl to perform the calculations:
library(purrr)
library(dplyr)
df %>%
# Step 1: split strings on `*`:
mutate(x_new = strsplit(x,"\\*")) %>%
# Step 2: convert to numeric and perform calculation:
mutate(x_new = ifelse(str_detect(x_new, ","),
map_dbl(x_new, function(x) as.numeric(x)[1] * as.numeric(x)[2]),
x_new))
x x_new
1 -5.50*1000000000 -5.5e+09
2 35% 35%
3 -4.00*1000 -4000
(warning messages can be ignored)
Test data:
df <- data.frame(x = c("-5.50*1000000000", "35%", "-4.00*1000"))
If your string is guaranteed to be a valid expression that R can evaluate literally, you can use
eval(parse(text = '-5*1000'))
This parses the string into R code equivalent, and then executes it using eval. In this case, it leads to a numerical result of -5000.
Tread with care. More background on using eval(parse) can be found here
Related
I am looking of a way to create a new column (using dplyr's mutate) based on certain "conditions".
library(tidyverse)
qq <- 5
df <- data.frame(rn = 1:qq,
a = rnorm(qq,0,1),
b = rnorm(qq,10,5))
myf <- function(dataframe,value){
result <- dataframe %>%
filter(rn<=value) %>%
nrow
return(result)
}
The above example is a rather simplified version for which I am trying to filter the piped dataframe (df) and obtain a new column (foo) whose values will depict how many rows there are with rn less than or equal to the current rn (each row's rn - coming from the piped df ). Below you can see the output I am getting vs the one I expect to obtain :
df %>%
mutate(
foo_i_am_getting = myf(.,rn),
foo_expected = 1:qq)
rn a b foo_i_am_getting foo_expected
1 1 -0.5403937 -4.945643 5 1
2 2 0.7169147 2.516924 5 2
3 3 -0.2610024 -7.003944 5 3
4 4 -0.9991419 -1.663043 5 4
5 5 1.4002610 15.501411 5 5
The actual calculation I am trying to perform is more cumbersome, however, if I solve the above simplified version, I believe I can handle the rest of the manipulation/calculations inside the custom function.
BONUS QUESTION : Currently the name of the column I want to apply the filter on (i.e. rn) is hardcoded in the custom function (filter(rn<=value)). It would be great if this was an argument of the custom function, to be passed 'tidyverse' style - i.e. without quotation marks - e.g. myf <- function(dataframe,rn,value)
Disclaimer : I 've done my best to describe the problem at hand, however, if there are still unclear spots please let me know so I can elaborate further.
Thanks in advance for your support!
You need to do it step by step, because now you are passing whole vector to filter instead of only one value each time:
df %>%
mutate(
foo_i_am_getting = map_dbl(.$rn, function(x) nrow(filter(., rn <= x))),
foo_expected = 1:qq)
Now we are passing 1 to filter for rn column (and function returns number of rows), then 2 for rn column.
Function could be:
myf <- function(vec_filter, dataframe, vec_rn) {
map_dbl(vec_filter, ~ nrow(filter(dataframe, {{vec_rn}} <= .x)))
}
df %>%
mutate(
foo_i_am_getting = map_dbl(.$rn, function(x) nrow(filter(., rn <= x))),
foo_expected = 1:qq,
foo_function = myf(rn, ., rn))
Preamble
I apologize in advance, I find it very difficult to express in writing the exact problem and think it is most clear by looking at the code. Also, I am relatively new to R and have trouble using the right words to accurately describe the situation. I think the solution should be easy to point out by someone with a bit more experience, any advice would be much appreciated!
Description
I have a specialized calculation that I am trying to make on a group-by-group basis for which I have written a function. The function is user-defined to execute this specialized calculation, requires 4 arguments (2 of which have a length >1) and outputs a single value (so the output does not equal the length of the input). While this function does work, I need to be able to efficiently apply it to each group in a data frame (for the reproducible example below there are 4 groups but, in reality, there will be 100s or 1000s of groups).
I've tried to use the apply functions, which are often recommended for situations that sound similar to this, but I have so far been unsuccessful in using a non-for loop approach. I think this is because each row in the data frame is not associated with a different group, rather there are multiple rows associated with a single group (for the reproducible example below there are 21 rows associated with each group, which is the same as the actual data).
Regardless, it seemed like a for loop would be a straightforward way to apply my function to the rows associated with each group. However, I am unable to generate the desired output. As I alluded to in the preamble, I think it is just because I am overlooking/unaware of something very fundamental, such as the need to do a loop inside a loop or index my for loop differently.
Reproducible Example
Functionally similar data
interval=0.05 #used here to generate v1 and again in the function
v1 = seq(0.00000000001,1.00000000001, by=interval)
nrows = length(v1) #determines length of other variables
g1 = c(rep(23.4, nrows), rep(19.7, nrows),rep(25.2, nrows),rep(16.4,
nrows))
v2 = runif(length(g1), 0,1)
dat = as.data.frame(cbind(g1,v1,v2))
Where:
g1 is the grouping variable
v1 is the first argument, repeats for each grouping var
v2 is the second argument, represents a probability associated with each v1
dat is the data frame
The Function
(This is my first function and I assume there is a better way to write it but it does work)
MyFunction = function(v1, v2, interval, nrows) {
sum.prod = sum(v1[2:nrows-1] * v2[2:nrows-1])
last.val = v2[nrows]/2
out = 2 * (sum.prod+last.val) * interval
out
}
Proof that the function works
I am providing the calculation for the first grouping variable (g1=23.4) just in case it is helpful to confirm that the function works and how it works since there is no documentation for this function
range1 = 1:nrows
g1.sub1 = dat$g1[range1]
v1.sub1 = dat$v1[range1]
v2.sub1 = dat$v2[range1]
g.first = 2 * ((v1.sub1[2] * v2.sub1[2])+
(v1.sub1[3] * v2.sub1[3]) + (v1.sub1[4] * v2.sub1[4]) +
(v1.sub1[5] * v2.sub1[5]) + (v1.sub1[6] * v2.sub1[6]) +
(v1.sub1[7] * v2.sub1[7]) + (v1.sub1[8] * v2.sub1[8]) +
(v1.sub1[9] * v2.sub1[9]) + (v1.sub1[10] * v2.sub1[10]) +
(v1.sub1[11] * v2.sub1[11]) + (v1.sub1[12] * v2.sub1[12]) +
(v1.sub1[13] * v2.sub1[13]) + (v1.sub1[14] * v2.sub1[14]) +
(v1.sub1[15] * v2.sub1[15]) + (v1.sub1[16] * v2.sub1[16]) +
(v1.sub1[17] * v2.sub1[17]) + (v1.sub1[18] * v2.sub1[18]) +
(v1.sub1[19] * v2.sub1[19]) + (v1.sub1[20] * v2.sub1[20]) +
v2.sub1[21] / 2) * interval
g.first
Which matches the value given by:
MyFunction(v1 = v1.sub1, v2 = v2.sub1, interval = interval, nrows=nrows)
Where I am Stuck: The For Loop
As I alluded to in the description, I've tried various approaches to solve this problem including the apply family of functions without luck. The following code represents the closest that I have come. However, this only gives me the correct value for the first element in g1 (23.4) four times rather than the correct value for each of the four elements in g1 (23.4, 19.9.25.2,16.4) one time.
g=c(unique((g1)))
out=NULL
for(i in seq_along(g)){
out[i]=MyFunction( v1 = v1, v2 = v2, interval = interval, nrows =
nrows)
}
out
Attempt to Troubleshoot the For Loop
I can force the above for loop to produce something similar to the desired results, but the range must be specified for each group and since the actual data has 100s of groups rather than just 4 groups and the total number of groups is not known in advance this is not a workable solution.
g=c(unique((g1)))
range1 = 1:nrows
range2 = (nrows+1):(nrows*2)
range3 = (nrows*2+1):(nrows*3)
range4 = (nrows*3+1):(nrows*4)
out1=NULL
out2=NULL
out3=NULL
out4=NULL
for(i in seq_along(g)){
out1[i]=MyFunction( v1 = dat$v1[range1], v2 = dat$v2[range1],
interval = interval, nrows = nrows)
out2[i]=MyFunction( v1 = dat$v1[range2], v2 = dat$v2[range2],
interval = interval, nrows = nrows)
out3[i]=MyFunction( v1 = dat$v1[range3], v2 = dat$v2[range3],
interval = interval, nrows = nrows)
out4[i]=MyFunction( v1 = dat$v1[range4], v2 = dat$v2[range4],
interval = interval, nrows = nrows)
}
out1
out2
out3
out4
The Desired Output
Ideally, the final output would be a table/matrix/list/data frame that contains each value of g1 and the associated value output by the function "out"
Something like:
g1 out
23.4 some value between 0 and 1
19.9 some value between 0 and 1
25.2 some value between 0 and 1
16.4 some value between 0 and 1
Concluding Thoughts
Since my "Attempt to Troubleshoot the For Loop" was ultimately able to provide the correct outputs, albeit in an undesirable way (labor intensive, not scalable, and it outputs 4 identical values for each group rather than 1 value for each group), I think this indicates that my code is lacking something fundamental (e.g., another loop, a different variable for seq_along, improper indexing, etc.). I hope this is easy for a more experienced user to identify and explain as I am stumped.
Thanks in advance!
I realize you asked for a for loop but as you've probably seen before, there's usually a better way to do it. I guess you're not familiar with the data.table package yet, think of it as a supercharged data.frame.
So what you want to do is to apply MyFunction to your data, grouped by the column g1. This can be easily achieved in data.table in the following way.
library(data.table)
DT <- as.data.table(dat)
DT[, .(out = MyFunction(v1, v2, interval, .N)), by = g1]
So what these lines do is first load the library (you might have to install it first with install.packages('data.table'). Then convert your data.frame to a data.table. Finally, compute the column out as MyFunction applied to v1, v2, interval and .N (think of .N as nrows) grouped by g1.
I think this achieves your goal, if you have any questions feel free to ask. Hope this helps.
Here's an approach using the tidyverse.
First, let's look at the example, replacing MyFunction with a few lines that capture the summarization process you describe:
library(tidyverse)
dat %>%
slice(1:21) %>% # Just the first grouping variable
slice(2:n()) %>% # Exclude first row; has small impact since v1[1] is nearly zero already...
mutate(prod = if_else(row_number() < n(), # For all rows but the last one in the group,
v1 * v2, # ... get the product of v1 and v2
v2/2)) %>% # ... or have of v2, for the last row
summarize(out = 2 * sum(prod) * interval) # Sum the "prod" row, * 2 * interval
# out
#1 0.5980449
To do this for all groups of g1, we'd add group_by first, and then do the same summarization steps separately for each group:
dat %>%
group_by(g1) %>%
slice(1:21) %>% # Just the first grouping variable
slice(2:n()) %>% # Exclude first row; has small impact since v1[1] is nearly zero already...
mutate(prod = if_else(row_number() < n(), # For all rows but the last one in the group,
v1 * v2, # ... get the product of v1 and v2
v2/2)) %>% # ... or have of v2, for the last row
summarize(out = 2 * sum(prod) * interval) # Sum the "prod" row, * 2 * interval
## A tibble: 4 x 2
# g1 out
# <dbl> <dbl>
#1 16.4 0.342
#2 19.7 0.514
#3 23.4 0.598
#4 25.2 0.568
I have managed to do chisq-test using loop in R but it is very slow for a large data and I wonder if you could help me out doing it faster with something like dplyr? I've tried with dplyr but I ended up getting an error all the time which I am not sure about the reason.
Here is a short example of my data:
df
1 2 3 4 5
row_1 2260.810 2136.360 3213.750 3574.750 2383.520
row_2 328.050 496.608 184.862 383.408 151.450
row_3 974.544 812.508 1422.010 1307.510 1442.970
row_4 2526.900 826.197 1486.000 2846.630 1486.000
row_5 2300.130 2499.390 1698.760 1690.640 2338.640
row_6 280.980 752.516 277.292 146.398 317.990
row_7 874.159 794.792 1033.330 2383.420 748.868
row_8 437.560 379.278 263.665 674.671 557.739
row_9 1357.350 1641.520 1397.130 1443.840 1092.010
row_10 1749.280 1752.250 3377.870 1534.470 2026.970
cs
1 1 1 2 1 2 2 1 2 3
What I want to do is to run chisq-test between each row of the df and cs. Then giving me the statistics and p.values as well as row names.
here is my code for the loop:
value = matrix(nrow=ncol(df),ncol=3)
for (i in 1:ncol(df)) {
tst <- chisq.test(df[i,], cs)
value[i,1] <- tst$p.value
value[i,2] <- tst$statistic
value[i,3] <- rownames(df)[i]}
Thanks for your help.
I guess you do want to do this column by column. Knowing the structure of Biobase::exprs(PANCAN_w)) would have helped greatly. Even better would have been to use an example from the Biobase package instead of a dataset that cannot be found.
This is an implementation of the code I might have used. Note: you do NOT want to use a matrix to store results if you are expecting a mixture of numeric and character values. You would be coercing all the numerics to character:
value = data.frame(p_val =NA, stat =NA, exprs = rownames(df) )
for (i in 1:col(df)) {
# tbl <- table((df[i,]), cs) ### No use seen for this
# I changed the indexing in the next line to compare columsn to the standard `cs`.
tst <- chisq.test(df[ ,i], cs) #chisq.test not vectorized, need some sort of loop
value[i, 1:2] <- tst[ c('p.value', 'statistic')] # one assignment per row
}
Obviously, you would need to change every instance of df (not a great name since there is also a df function) to Biobase::exprs(PANCAN_w)
Suppose I have a string
age<-c("7y2m4d","5m4d","7y5m6d")
I want to convert it to a numeric vector like
c(7.34, 0.43, 7.43)
How can I make the R code?
We can assume there is 365 days in a year and 365/12 days in a month.
lubridate::duration will convert your strings to (approximate) seconds.
library(lubridate)
library(magrittr)
age <- c("7y2m4d", "5m4d", "7y5m6d")
age_sec <- age %>%
duration() %>%
as.numeric()
age_sec
[1] 226508400 13494600 234570600
Then you can approximate years as 365 * 24 * 60 * 60 seconds:
age_sec / (365 * 24 * 60 * 60)
[1] 7.182534 0.427911 7.438185
Another solution with base R:
age<-c("7y2m4d","5m4d","7y5m6d")
age <- gsub('y', ' + ', age)
age <- gsub('m', ' / 12 + ', age)
age <- gsub('d', ' / 365', age)
sapply(age, function(x) eval(parse(text = x)))
#7 + 2 / 12 + 4 / 365 5 / 12 + 4 / 365 7 + 5 / 12 + 6 / 365
# 7.1776256 0.4276256 7.4331050
The idea is to create the formula and then evaluate it for each element of your vector.
These solutions:
handle missing y, m and/or d and
give the same answer as in the question (except for the first element of age for which the question appears to have computed the answer incorrectly)
avoid the use of eval
only use base (except for alternative 1a)
Comparing the solutions below on the basis of simplicity (1a) is the simplest and automatically handles all the edge cases without specific code for them suggesting that it is the most natural; however, it does make use of a package. (1) is only slightly more complex and uses no packages and (2) pretty short and also does not use any packages but it is not as simple as (1) or (1a).
1) Here getNum extracts and returns the number from x associated with the code (the code is "y", "m" or "d") or if the code is not present in x returns 0. We then add up the year, month/12 and day/365.
getNum <- function(code, x) {
pat <- sprintf(".*?(\\d+)%s.*", code)
as.numeric(ifelse(grepl(code, x), sub(pat, "\\1", x), 0))
}
getNum("y", age) + getNum("m", age) / 12 + getNum("d", age) / 365
## [1] 7.1776256 0.4276256 7.4331050
1a) This is similar to (1) except that we use strapply in gsubfn to simplify getNum. In fact getNum reduces to a single strapply call and the regular expression it uses is also simpler.
library(gsubfn)
getNum <- function(code, x) {
strapply(x, paste0("(\\d+)", code), as.numeric, empty = 0, simplify = TRUE)
}
getNum("y", age) + getNum("m", age) / 12 + getNum("d", age) / 365
## [1] 7.1776256 0.4276256 7.4331050
2) This alternative converts each string to dcf format and uses read.dcf to create a matrix of the y, m and d numbers.
In detail, the first line of code is to handle certain edge cases which are not actually present in the sample data in the question. We first append 0d to age (from the question) if d is missing so that we can handle the case where y, m and d are all missing. We also prepend a dummy entry to ensure that y, m and d are present in at least one entry. If we knew that y, m and d were present in at least one component and there was no component in which y, m and d were all simultaneously missing then this first line of code could be omitted.
The second line of code converts each input character string to dcf form and reads it into a matrix ensuring that the columns are in a known order and deleting the dummy entry added above.
Finally we replace NAs with 0 and and use matrix multiplication to add up the year, month/12 and day/365.
a0 <- c("0y0m0d", paste0(age, ifelse(grepl("d", age), "", "0d")))
m <- read.dcf(textConnection(gsub("(\\d+)(\\D)", "\\2: \\1\n", a0)))[-1, c("y", "m", "d")]
m[is.na(m)] <- 0
c(array(as.numeric(m), dim(m)) %*% c(1, 1/12, 1/365))
## [1] 7.1776256 0.4276256 7.4331050
Update: Rearranged and added (1) and (1a).
I have a large file with the following format which I read as x
userid,productid,freq
293994,8,3
293994,5,3
949859,2,1
949859,1,1
123234,1,1
123234,3,1
123234,4,1
...
It gives the product a given user bought and its frequency. I'm trying to make it into a matrix which gives all the productid's as columns and userids as rows with the frequency value as the entry. So the expected output is
1 2 3 4 5 8
293994 0 0 0 0 3 3
949859 1 1 0 0 0 0
123234 1 0 1 1 0 0
It is a sparse matrix. I tried doing table(x[[1]],x[[2]]) which works for small files, but beyond a point table gives an error
Error in table(x[[1]], x[[2]]) :
attempt to make a table with >= 2^31 elements
Execution halted
Is there a way to get this to work? I'm on R-3.1.0 and its supposed to support 2^51 sized vectors, so confused why it can't handle the file size. I've 40MM lines with total file size of 741M. Thanks in advance
One data.table way of doing it is:
library(data.table)
library(reshape2)
# adjust fun.aggregate as necessary - not very clear what you want from OP
dcast.data.table(your_data_table, userid ~ productid, fill = 0L)
You can check if that works for your data.
#This is old, but worth noting the Matrix package sparseMatrix() to directly format object without reshaping.
userid <- c(293994,293994,949859,949859,123234,123234,123234)
productid <- c(8,5,2,1,1,3,4)
freq <- c(3,3,1,1,1,1,1)
library(Matrix)
#The dgCMatrix sparseMatrix is a fraction of the size and builds much faster than reshapeing if the data gets large
x <- sparseMatrix(i=as.integer(as.factor(userid)),
j=as.integer(as.factor(productid)),
dimnames = list(as.character(levels(as.factor(userid))),
as.character(levels(as.factor(productid)))
),
x=freq)
#Easily converted to a matrix.
x <- as.matrix(x)
#Learned this the hard way using recommenderlab (package built on top of Matrix) to build a binary matrix, so in case it helps someone else.
Here is a tidyr approach to this:
library(tidyverse)
library(magrittr)
# Replicate your example data
example_data <- matrix(
c(293994,8,3,
293994,5,3,
949859,2,1,
949859,1,1,
123234,1,1,
123234,3,1,
123234,4,1),
ncol = 3,
byrow = TRUE) %>%
as.data.frame %>%
set_colnames(c('userid','productid','freq'))
# Convert data into wide format
spread(example_data, key = productid, value = freq, fill = 0)
spread will be a lot faster than the base R table operation, but at scale, data.table would in turn easily outperform tidyr / dplyr. However, as noted in the previous answer, the data.table equivalent dcast isn't working properly. This seems to be a known issue which, unfortunately, remains unresolved.
I tried the tidyr approach at scale (2 mio records). I couldn't make it work on my local machine. So you'll have to either chop it up (then using rbind) or take it to a cluster (with rhadoop or sparklyr).
Nonetheless, code for a reproducible "big data" example below in case somebody else would like to add something.
# Make some random IDs
randomkey <- function(digits){
paste(sample(LETTERS, digits, replace = TRUE), collapse = '')
}
products <- replicate(10, randomkey(20)) %>% unique
customers <- replicate(500000, randomkey(50)) %>% unique
big_example_data <- data.frame(
useruid = rep(sample(customers, length(customers), replace = FALSE), 4),
productid = sample(products, replace = TRUE),
freq = sample(1:5)
)
# 2 mio rows of purchases
dim(big_example_data)
# With useruid, productid, freq
head(big_example_data)
# Test tidyr approach
system.time(
big_matrix <- spread(big_example_data, key = productid, value = freq, fill = 0)
)