I wan to generate 300 random data based on the following criteria:
Class value
0 1-8
1 9-11
2 12-14
3 15-16
4 17-20
Logic: when class = 0, I want to get random data between 1-8. Or when class= 1, I want to get random data between 9-11 and so on.
This gives me the following hypothetical table as an example:
Class Value
0 7
0 4
1 10
1 9
1 11
. .
. .
I want to have equal and unequal mixtures in each class
You could do:
df <- data.frame(Class = sample(0:4, 300, TRUE))
df$Value <- sapply(list(1:8, 9:11, 12:14, 15:16, 17:20)[df$Class + 1],
sample, size = 1)
This gives you a data frame with 300 rows and appropriate numbers for each class:
head(df)
#> Class Value
#> 1 0 3
#> 2 1 10
#> 3 4 19
#> 4 2 12
#> 5 4 19
#> 6 1 10
Created on 2022-12-30 with reprex v2.0.2
Providing some additional flexibility in the code, so that different probabilities can be used in the sampling, and having the smallest possible amount of hard-coded values:
# load data.table
library(data.table)
# this is the original data
a = structure(list(Class = 0:4, value = c("1-8", "9-11", "12-14",
"15-16", "17-20")), row.names = c(NA, -5L), class = c("data.table",
"data.frame"))
# this is to replace "-" by ":", we will use that in a second
a[, value := gsub("\\-", ":", value)]
# this is a vector of EQUAL probabilities
probs = rep(1/a[, uniqueN(Class)], a[, uniqueN(Class)])
# This is a vector of UNEQUAL Probabilities. If wanted, it should be
# uncommented and adjusted manually
# probs = c(0.05, 0.1, 0.2, 0.4, 0.25)
# This is the number of Class samples wanted
numberOfSamples = 300
# This is the working horse
a[sample(.N, numberOfSamples, TRUE, prob = probs), ][,
smpl := apply(.SD,
1,
function(x) sample(eval(parse(text = x)), 1)),
.SDcols = "value"][,
.(Class, smpl)]
What is good about this code?
If you change your classes, or the value ranges, the only change you need to be concerned about is the original data frame (a, as I called it)
If you want to use uneven probabilities for your sampling, you can set them and the code still runs.
If you want to take a smaller or larger sample, you don't have to edit your code, you only change the value of a variable.
Related
data:
test <- structure(list(fgu_clms = c(14621389.763697, 145818119.352026,
21565415.2337476, 20120830.8221406, 12999772.0950838), loss_to_layer = c(0,
125818119.352026, 1565415.23374765, 120830.822140567, 0)), row.names = c(NA,
5L), class = "data.frame")
> test
fgu_clms loss_to_layer
1 14621390 0.0
2 145818119 125818119.4
3 21565415 1565415.2
4 20120831 120830.8
5 12999772 0.0
I want to create a new column which tries to use a cumulative sum on the rows above it. It's easier if I show how the calculation on the new column works row by row:
row 1: first calculate the sum the value of rows above in the same column. As this is row 1 there are no rows above this value is 0, call this cumsum_1. It should then take the minimum of the value of row 1 in column "loss_to_layer" and the calculation "x2 - cumsum_1".
In row 2: calculate the cumsum by looking at the value above, i.e. min(x2-cumsum_1,loss_to_layer value). Call this cumsum_2. Then repeat as above, i.e. be subject to the minimum of the value on row 2 of the loss-to_layer column and x2 - cumsum_2.
And so on.
In excel, this would be done by using MIN(B2,x2 - SUM(C$1:C1)) and dragging this formula down.
The results with x2 = 127,000,000 should be:
fgu_clms loss_to_layer new_col
1 14621390 0.0 0
2 145818119 125818119.4 125818119
3 21565415 1565415.2 1181881
4 20120831 120830.8 0
5 12999772 0.0 0
As you can see the sum of the "new_col" always sums back up to "x2", in this case 127,000,000.
I have tried:
test <- test %>% mutate(new_col = pmin(loss_to_layer,127e6-cumsum(lag(new_col,1,default=0))))
But get an error as it cannot find the column new_col in the lag function
test %>%
mutate(
cumsum_1 = cumsum(lag(loss_to_layer, default = 0)),
new_col = pmin(loss_to_layer, 127000000 - cumsum_1),
new_col = ifelse(new_col < 0, 0, new_col)
) %>%
select(-cumsum_1)
My data frame looks like this:
person. id98 id100 id102 educ98 educ100 educ102 pid98 pid100 pid102
1. 3. 0. 0. 2. 4. 5. T. F. F
2. ....
I hope to transform it like this:
person. year. id. educ. pid.
1. 98
1. 100
1. 102
In Stata, I know that the "reshape" command can automatically identify the year from those variables' names. In R, I don't know how to deal with that.
I want to get the number that is trailing in each column name and bundle the column based on that number.
If you would like to use reshape, maybe the code below could help
reshape(
setNames(df, gsub("(\\d+)", "\\.\\1", names(df))),
# the gsub needed because `reshape` expects a period as a separator
direction = "long",
varying = -1
)
which gives
person. time id educ pid
1.98 1 98 1 2 TRUE
1.100 1 100 1 4 FALSE
1.102 1 102 1 5 FALSE
Data
> dput(df)
structure(list(person. = 1, id98 = 3, id100 = 0, id102 = 0, educ98 = 2,
educ100 = 4, educ102 = 5, pid98 = TRUE, pid100 = FALSE, pid102 = FALSE), class = "data.frame", row.names = c(NA,
-1L))
You can use pivot_longer from tidyr. Using data from #ThomasIsCoding
tidyr::pivot_longer(df,
cols = -person.,
names_to = c('.value', 'year'),
names_pattern = '([a-z]+)(\\d+)')
# person. year id educ pid
# <dbl> <chr> <dbl> <dbl> <lgl>
#1 1 98 3 2 TRUE
#2 1 100 0 4 FALSE
#3 1 102 0 5 FALSE
Using the data.table package this is fairly easy.
As a note I think this'll only work if the columns are ordered in the same order i.e id90 id100 id102 pid90 pid100 pid102 ... etc.
Edit
The aforementioned issue has been solved in this new code.
# loading data.table
if(!require(data.table)){
install.packages("data.table")
library(data.table)
}
df= data.frame(person=1:5, id90=rnorm(5), id91=rnorm(5), id92=rnorm(5), pid90=rnorm(5), pid91=rnorm(5), pid92=rnorm(5), educ90=rnorm(5), educ91=rnorm(5), educ92=rnorm(5))
# turn data.frame in data.table
setDT(df)
cols = colnames(df)[order(colnames)]
# df[, ..cols] reorders the columns alphabetically
# to evade the problem stated above.
# id.vars is the id vars
# using patterns with measure vars will bundle all the columns
# that match the regex pattern in the same column
dt <- melt(df[, ..cols], id.vars="person", measure.vars=patterns(id="^id", educ="^educ", pid="^pid"))
# getting the years
years = gsub('^id', '', colnames(df)[grepl('^id', colnames(df))])
# changing the years
dt[, c("year","variable"):=list(years[variable], NULL)]
Is it possible to restrict a data frame to a specific row and then change some values in one of the columns?
Let's say I calculate GROWTH as (SIZE_t+1 - SIZE_t)/SIZE_t and now I can see that there are some strange values for GROWTH (e.g. 1000) and the reason is a corrupt value of the corresponding SIZE variable. Now I'd like to find and replace the corrupt value of SIZE.
If I type:
data <- mutate(filter(data, lead(GROWTH)==1000), SIZE = 2600)
then only the corrupt row is stored in data and the rest of my data frame is lost.
What I'd like to do instead is filter "data" on the left hand side to the corresponding row of the corrupt value and then mutate the incorrect variable (on the right hand side):
filter(data, lead(GROWTH)==1000) <- mutate(filter(data, lead(GROWTH)==1000), SIZE = 2600)
but that doesn't seem to work. Is there a way to handle this using dplyr? Many thanks in advance
You can use an ifelse statement with mutate function. Let's say you have a data frame with some corrupted values in SIZE at row 3 which lead to a large GROWTH value at row 4 and you want to replace the SIZE at row 3, with some value 0.3 here(I chose to be different from yours just to be consistent with my values). The GROWTH > 1000 condition can be replaced accordingly.
data
SIZE GROWTH
1 -1.49578498 NA
2 -0.38731784 -0.7410605
3 0.00010000 -1.0002582
4 0.53842217 5383.2216758
5 -0.65813674 -2.2223433
6 0.29830698 -1.4532599
7 0.04712019 -0.8420413
8 -0.07312482 -2.5518788
9 1.64310713 -23.4698959
10 1.44927727 -0.1179654
library(dplyr)
data %>% mutate(SIZE = ifelse(lead(GROWTH > 1000, default = F), 0.3, SIZE))
SIZE GROWTH
1 -1.49578498 NA
2 -0.38731784 -0.7410605
3 0.30000000 -1.0002582
4 0.53842217 5383.2216758
5 -0.65813674 -2.2223433
6 0.29830698 -1.4532599
7 0.04712019 -0.8420413
8 -0.07312482 -2.5518788
9 1.64310713 -23.4698959
10 1.44927727 -0.1179654
Data:
structure(list(SIZE = c(-1.49578498093657, -0.387317841955887,
1e-04, 0.538422167582116, -0.658136741561064, 0.298306980856383,
0.0471201873908915, -0.0731248216938637, 1.64310713116132, 1.44927727104653
), GROWTH = c(NA, -0.741060482026387, -1.00025818588551, 5383.22167582116,
-2.22234332311492, -1.45325988053609, -0.842041284935343, -2.55187883883499,
-23.4698958999199, -0.117965442690154)), class = "data.frame", .Names = c("SIZE",
"GROWTH"), row.names = c(NA, -10L))
I am creating correlations using R, with the following code:
Values<-read.csv(inputFile, header = TRUE)
O<-Values$Abundance_O
S<-Values$Abundance_S
cor(O,S)
pear_cor<-round(cor(O,S),4)
outfile<-paste(inputFile, ".jpg", sep = "")
jpeg(filename = outfile, width = 15, height = 10, units = "in", pointsize = 10, quality = 75, bg = "white", res = 300, restoreConsole = TRUE)
rx<-range(0,20000000)
ry<-range(0,200000)
plot(rx,ry, ylab="S", xlab="O", main="O vs S", type="n")
points(O,S, col="black", pch=3, lwd=1)
mtext(sprintf("%s %.4f", "pearson: ", pear_cor), adj=1, padj=0, side = 1, line = 4)
dev.off()
pear_cor
I now need to find the lower quartile for each set of data and exclude data that is within the lower quartile. I would then like to rewrite the data without those values and use the new column of data in the correlation analysis (because I want to threshold the data by the lower quartile). If there is a way I can write this so that it is easy to change the threshold by applying arguments from Java (as I have with the input file name) that's even better!
Thank you so much.
I have now implicated the answer below and that is working, however I need to keep the pairs of data together for the correlation. Here is an example of my data (from csv):
Abundance_O Abundance_S
3635900.752 1390.883073
463299.4622 1470.92626
359101.0482 989.1609251
284966.6421 3248.832403
415283.663 2492.231265
2076456.856 10175.48946
620286.6206 5074.268802
3709754.717 269.6856808
803321.0892 118.2935093
411553.0203 4772.499758
50626.83554 17.29893001
337428.8939 203.3536852
42046.61549 152.1321255
1372013.047 5436.783169
939106.3275 7080.770535
96618.01393 1967.834701
229045.6983 948.3087208
4419414.018 23735.19352
So I need to exclude both values in the row if one does not meet my quartile threshold (0.25 quartile). So if the quartile for O was 45000 then the row "42046.61549,152.1321255" would be removed. Is this possible? If I read in both columns as a dataframe can I search each column separately? Or find the quartiles and then input that value into code to remove the appropriate rows?
Thanks again, and sorry for the evolution of the question!
Please try to provide a reproducible example, but if you have data in a data.frame, you can subset it using the quantile function as the logical test. For instance, in the following data we want to select only rows from the dataframe where the value of the measured variable 'Val' is above the bottom quartile:
# set.seed so you can reproduce these values exactly on your system
set.seed(39856)
df <- data.frame( ID = 1:10 , Val = runif(10) )
df
ID Val
1 1 0.76487516
2 2 0.59755578
3 3 0.94584374
4 4 0.72179297
5 5 0.04513418
6 6 0.95772248
7 7 0.14566118
8 8 0.84898704
9 9 0.07246594
10 10 0.14136138
# Now to select only rows where the value of our measured variable 'Val' is above the bottom 25% quartile
df[ df$Val > quantile(df$Val , 0.25 ) , ]
ID Val
1 1 0.7648752
2 2 0.5975558
3 3 0.9458437
4 4 0.7217930
6 6 0.9577225
7 7 0.1456612
8 8 0.8489870
# And check the value of the bottom 25% quantile...
quantile(df$Val , 0.25 )
25%
0.1424363
Although this is an old question, I came across it during research of my own and I arrived at a solution that someone may be interested in.
I first defined a function which will convert a numerical vector into its quantile groups. Parameter n determines the quantile length (n = 4 for quartiles, n = 10 for deciles).
qgroup = function(numvec, n = 4){
qtile = quantile(numvec, probs = seq(0, 1, 1/n))
out = sapply(numvec, function(x) sum(x >= qtile[-(n+1)]))
return(out)
}
Function example:
v = rep(1:20)
> qgroup(v)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Consider now the following data:
dt = data.table(
A0 = runif(100),
A1 = runif(100)
)
We apply qgroup() across the data to obtain two quartile group columns:
cols = colnames(dt)
qcols = c('Q0', 'Q1')
dt[, (qcols) := lapply(.SD, qgroup), .SDcols = cols]
head(dt)
> A0 A1 Q0 Q1
1: 0.72121846 0.1908863 3 1
2: 0.70373594 0.4389152 3 2
3: 0.04604934 0.5301261 1 3
4: 0.10476643 0.1108709 1 1
5: 0.76907762 0.4913463 4 2
6: 0.38265848 0.9291649 2 4
Lastly, we only include rows for which both quartile groups are above the first quartile:
dt = dt[Q0 + Q1 > 2]
I am trying to plot the CDF curve for a large dataset containing about 29 million values using ggplot. The way I am computing this is like this:
mycounts = ddply(idata.frame(newdata), .(Type), transform, ecd = ecdf(Value)(Value))
plot = ggplot(mycounts, aes(x=Value, y=ecd))
This is taking ages to plot. I was wondering if there is a clean way to plot only a sample of this dataset (say, every 10th point or 50th point) without compromising on the actual result?
I am not sure about your data structure, but a simple sample call might be enough:
n <- nrow(mycounts) # number of cases in data frame
mycounts <- mycounts[sample(n, round(n/10)), ] # get an n/10 sample to the same data frame
Instead of taking every n-th point, can you quantize your data set down to a sufficient resolution before plotting it? That way, you won't have to plot resolution you don't need (or can't see).
Here's one way you can do it. (The function I've written below is generic, but the example uses names from your question.)
library(ggplot2)
library(plyr)
## A data set containing two ramps up to 100, one by 1, one by 10
tens <- data.frame(Type = factor(c(rep(10, 10), rep(1, 100))),
Value = c(1:10 * 10, 1:100))
## Given a data frame and ddply-style arguments, partition the frame
## using ddply and summarize the values in each partition with a
## quantized ecdf. The resulting data frame for each partition has
## two columns: value and value_ecdf.
dd_ecdf <- function(df, ..., .quantizer = identity, .value = value) {
value_colname <- deparse(substitute(.value))
ddply(df, ..., .fun = function(rdf) {
xs <- rdf[[value_colname]]
qxs <- sort(unique(.quantizer(xs)))
data.frame(value = qxs, value_ecdf = ecdf(xs)(qxs))
})
}
## Plot each type's ECDF (w/o quantization)
tens_cdf <- dd_ecdf(tens, .(Type), .value = Value)
qplot(value, value_ecdf, color = Type, geom = "step", data = tens_cdf)
## Plot each type's ECDF (quantizing to nearest 25)
rounder <- function(...) function(x) round_any(x, ...)
tens_cdfq <- dd_ecdf(tens, .(Type), .value = Value, .quantizer = rounder(25))
qplot(value, value_ecdf, color = Type, geom = "step", data = tens_cdfq)
While the original data set and the ecdf set had 110 rows, the quantized-ecdf set is much reduced:
> dim(tens)
[1] 110 2
> dim(tens_cdf)
[1] 110 3
> dim(tens_cdfq)
[1] 10 3
> tens_cdfq
Type value value_ecdf
1 1 0 0.00
2 1 25 0.25
3 1 50 0.50
4 1 75 0.75
5 1 100 1.00
6 10 0 0.00
7 10 25 0.20
8 10 50 0.50
9 10 75 0.70
10 10 100 1.00
I hope this helps! :-)