Subsetting one dataframe from another does not give expected result - r

I have 2 dataframes df1 and df2.
df1 contains 2 columns - t1 and data1, with t1 starting from 0.0001 till 75, with an increment of 0.0001. So it goes like 0.0001, 0.0002, 0.0003..... 74.9999, 75.0000. data1 is just some numbers between 0 and 1.
df2 also contains 2 columns - t2 and data2, but the length of each column is 114 - only selected values between 0.0001 and 75 are present in the time column - eg. 14.6000,15.2451,....73.4568. data2 is again some random numbers with length of 114
I have extracted the values of t2 from another data set
t2<- c(14.6000, 14.6001, 14.6002, 14.6002, 14.6007, 14.6011, 14.6016, 14.602, 14.6037, 14.6055, 14.6072, 14.6089, 14.6151, 14.6214, 14.6277, 14.6339, 14.6402, 14.6545, 14.6688, 14.6831, 14.6974, 14.7117, 14.7261, 14.7573, 14.7886, 14.8199, 14.8511, 14.8824, 14.9137, 14.9681, 15.0225, 15.0768, 15.1312, 15.1856, 15.24, 15.3233, 15.4065, 15.4897, 15.573, 15.6562, 15.7394, 15.8768, 16.0142, 16.1516, 16.289, 16.4264, 16.5638, 16.7676, 16.9715, 17.1753, 17.3792, 17.583, 17.7868, 17.9907, 18.3366, 18.6826, 19.0285, 19.3745, 19.7204, 20.0664, 20.4124, 20.9122, 21.412, 21.9118, 22.4116, 22.9114, 23.4112, 23.911, 24.5965, 25.282, 25.9675, 26.653, 27.3385, 28.024, 29.1158, 30.2075, 31.2993, 32.3911, 33.4828, 34.6828, 35.8828, 37.0828, 38.2828, 39.4828, 40.6828, 41.8828, 43.0828, 44.2828, 45.4828, 46.6828, 47.8828, 49.0828, 50.2828, 51.4828, 52.6828, 53.8828, 55.0828, 56.2828, 57.4828, 58.6828, 59.8828, 61.0828, 62.2828, 63.4828, 64.6828, 65.8828, 67.0828, 68.2828, 69.4828, 70.6828, 71.8828, 73.0828, 74.2828,74.6000)
df1<- data.frame("t1"=seq(0.0001,75,0.0001), "data1"=c(rnorm(750000)))
df2<- data.frame("t2"=t2, "data2"=c(rnorm(length(t2))))
I want to create a new dataframe - df_new , in which I want to pick the values of t2 and the corresponding data1 values from df1
df_new<- subset(df1,t1 %in% df2$t2)
When I do this, df_new has only 74 observations, instead of 114. Am I doing something wrong here?

This seems to be a problem with floating point arithmetic. See two examples below. In general, directly comparing floats like this is not necessarily going to be robust because the accuracies of the representation isn't perfect. I picked the first element in df2$t2 that doesn't line up as expected. You would hope that the first == comparison would return true but it doesn't. See that all.equal, which confusingly tests "near equality", does in fact return true for the two objects I pulled out. You can see that there is a difference by changing the digits printed with options.
One way to get the intended result is to use round to make all the numbers you want the same. Note that there are only 113 rows in your output because there are only 113 unique values in df2$t2 as provided. You might also consider converting to integers (with correspondingly smaller units).
t2<- c(14.6000, 14.6001, 14.6002, 14.6002, 14.6007, 14.6011, 14.6016, 14.602, 14.6037, 14.6055, 14.6072, 14.6089, 14.6151, 14.6214, 14.6277, 14.6339, 14.6402, 14.6545, 14.6688, 14.6831, 14.6974, 14.7117, 14.7261, 14.7573, 14.7886, 14.8199, 14.8511, 14.8824, 14.9137, 14.9681, 15.0225, 15.0768, 15.1312, 15.1856, 15.24, 15.3233, 15.4065, 15.4897, 15.573, 15.6562, 15.7394, 15.8768, 16.0142, 16.1516, 16.289, 16.4264, 16.5638, 16.7676, 16.9715, 17.1753, 17.3792, 17.583, 17.7868, 17.9907, 18.3366, 18.6826, 19.0285, 19.3745, 19.7204, 20.0664, 20.4124, 20.9122, 21.412, 21.9118, 22.4116, 22.9114, 23.4112, 23.911, 24.5965, 25.282, 25.9675, 26.653, 27.3385, 28.024, 29.1158, 30.2075, 31.2993, 32.3911, 33.4828, 34.6828, 35.8828, 37.0828, 38.2828, 39.4828, 40.6828, 41.8828, 43.0828, 44.2828, 45.4828, 46.6828, 47.8828, 49.0828, 50.2828, 51.4828, 52.6828, 53.8828, 55.0828, 56.2828, 57.4828, 58.6828, 59.8828, 61.0828, 62.2828, 63.4828, 64.6828, 65.8828, 67.0828, 68.2828, 69.4828, 70.6828, 71.8828, 73.0828, 74.2828,74.6000)
set.seed(12345)
df1<- data.frame("t1"=seq(0.0001,75,0.0001), "data1"=c(rnorm(750000)))
df2<- data.frame("t2"= t2, "data2"=c(rnorm(length(t2))))
df2$t2[2]
#> [1] 14.6001
df1$t1[146001]
#> [1] 14.6001
df1$t1[146001] == df2$t2[2]
#> [1] FALSE
all.equal(df1$t1[146001], df2$t2[2])
#> [1] TRUE
options(digits = 22)
df2$t2[2]
#> [1] 14.600099999999999
df1$t1[146001]
#> [1] 14.600100000000001
df_new_rnd <- subset(df1, round(t1, 4) %in% round(df2$t2, 4))
df_new_int <- subset(df1, as.integer(t1 * 10000) %in% as.integer(df2$t2 * 10000))
nrow(df_new_rnd)
#> [1] 113
nrow(df_new_int)
#> [1] 113
Created on 2018-05-22 by the reprex package (v0.2.0).

Related

Randomly select strings based on multiple criteria in R

I'm trying to select strings based on multiple criteria but so far no success.
My vector contains the following strings (a total of 48 strings): (1_A, 1_B, 1_C, 1_D, 2_A, 2_B, 2_C, 2_D... 12_A, 12_B, 12_C, 12_D)
I need to randomly select 12 strings. The criteria are:
I need one string containing each number
I need exactly three strings that contains each letter.
I need the final output to be something like: 1_A, 2_A, 3_A, 4_B, 5_B, 6_B, 7_C, 8_C, 9_C, 10_D, 11_D, 12_D.
Any help will appreciated.
All the best,
Angelica
The trick here is not to use your vector at all, but to create the sample strings from their components, which are randomly chosen according to your criteria.
sample(paste(sample(12), rep(LETTERS[1:4], 3), sep = '_'))
#> [1] "12_D" "8_C" "7_B" "1_B" "6_D" "5_A" "4_B" "10_A" "2_C" "3_A" "11_D" "9_C"
This will give a different result each time.
Note that all 4 letters are always represented exactly 3 times since we use rep(LETTERS[1:4], 3), all numbers 1 to 12 are present exactly once but in a random order since we use sample(12), and the final result is shuffled so that the order of the letters and the order of the numbers is not predictable.
If you want the result to give you the indices of your original vector where the samples are from, then it's easy to do that using match. We can recreate your vector by doing:
vec <- paste(rep(1:12, each = 4), rep(LETTERS[1:4], 12), sep = "_")
vec
#> [1] "1_A" "1_B" "1_C" "1_D" "2_A" "2_B" "2_C" "2_D" "3_A" "3_B"
#> [11] "3_C" "3_D" "4_A" "4_B" "4_C" "4_D" "5_A" "5_B" "5_C" "5_D"
#> [21] "6_A" "6_B" "6_C" "6_D" "7_A" "7_B" "7_C" "7_D" "8_A" "8_B"
#> [31] "8_C" "8_D" "9_A" "9_B" "9_C" "9_D" "10_A" "10_B" "10_C" "10_D"
#> [41] "11_A" "11_B" "11_C" "11_D" "12_A" "12_B" "12_C" "12_D"
And to find the location of the random samples we can do:
samp <- match(sample(paste(sample(12), rep(LETTERS[1:4], 3), sep = '_')), vec)
samp
#> [1] 30 26 37 43 46 20 8 3 33 24 15 9
So that, for example, you can retrieve an appropriate sample from your vector with:
vec[samp]
#> [1] "8_B" "7_B" "10_A" "11_C" "12_B" "5_D" "2_D" "1_C" "9_A" "6_D"
#> [11] "4_C" "3_A"
Created on 2022-04-10 by the reprex package (v2.0.1)

How to loop a conditional execution across a vector (multiple ifelse scenarios) and return multiple vectors for each scenario

I have a mock dataset where I want to evaluate how break points (in longitude) influence an outcome (a designation of east vs. west).
For example, the following line of code would amend 1 column to the dataframe (labeled region) filled with "East" or "West" depending on if the value in the Longitude column is greater or less than -97.
wnv$region <- ifelse(wnv$Longitude>-97, "East", "West")
Eventually, I want to see how different thresholds (not -97) would affect another variable in the dataset. Thus, I want to loop across a vector of values-- say breakpoints <- seq(-171, -70, 5) -- and get a new vector (column in the dataset) for each value in the breakpoint sequence.
How might you do this in a loop rather than writing a new ifelse statement for each breakpoint?
Thanks
I've included different regions for each breakpoint, in case that is desired. If not, just remove regions and replace the the regions[[k]][1] and regions[[k]][2] with your desired values.
breakpoints <- c(-171, 70, 5)
col_names <- paste("gt_breakpoint", seq_along(breakpoints), sep = "_")
wnv <- data.frame(longitude = c(50, -99, 143, 90, 2, -8))
regions <- list(c("region1A", "region1B"), c("region2A", "region2B"),
c("region3A", "region3B"))
for (k in (seq_along(breakpoints))) {
wnv[[col_names[k]]] <- ifelse(wnv$longitude > breakpoints[k], regions[[k]][1],
regions[[k]][2])
}
wnv
#> longitude gt_breakpoint_1 gt_breakpoint_2 gt_breakpoint_3
#> 1 50 region1A region2B region3A
#> 2 -99 region1A region2B region3B
#> 3 143 region1A region2A region3A
#> 4 90 region1A region2A region3A
#> 5 2 region1A region2B region3B
#> 6 -8 region1A region2B region3B

Double entries in dataframe after merg r

My data
Hello, I have a problem with merging two dataframes with each other.
The goal is to merge them so that each date has the corresponding values. If there is no corresponding value, I want to replace NA with 0.
names(FiresNearLA.ab.03)[1] <- "Date.Local"
U.NO2.ab.03 <- unique(NO2.ab.03) # No2.ab.03 has all values multiplied
ind <- merge(FiresNearLA.ab.03,U.NO2.ab.03, all = TRUE, all.x=TRUE)
ind[is.na(ind)] <- 0
So far so good. And the first lines look like they are supposed to look. But beginning from 2004-04-24, all dates are doubled and it writes weird values in the second NO2.Mean colum.
U.NO2.Mean table:
Date.Local NO2.Mean
361 2004-03-31 30.217391
365 2004-04-24 50.000000
366 2004-04-25 47.304348
370 2004-04-26 50.913043
374 2004-04-27 41.157895
ind table:
Date.Local FIRE_SIZE F.number.n_fires NO2.Mean
113 2004-04-22 34.30 10 13.681818
114 2004-04-23 45.00 13 17.222222
115 2004-04-24 55.40 22 28.818182
116 2004-04-24 55.40 22 50.000000
117 2004-04-25 2306.85 15 47.304348
118 2004-04-25 2306.85 15 21.090909
Why, are there Values in NO2.Mean for 2004-04-23 and 2004-04-22 days if they should be 0? and why does it double the values after the 24th and where do the second ones come from?
Thank you
So I managed to merge your data:
FiresNearLA.ab.03 <- dget("FiresNearLA.ab.03.txt", keep.source = FALSE)
U.NO2.ab.03 <- dget("NO2.ab.03.txt", keep.source = FALSE)
ind <- merge(FiresNearLA.ab.03,
U.NO2.ab.03,
all = TRUE,
by.x = "DISCOVERY_DATEymd",
by.y = "Date.Local")
As a side note: Usually, you share a small sample of your data on stackoverflow, not the whole thing. In your case, dput(FiresNearLA.ab.03[1:50, ]) and then copy and paste from the console to the question would have been sufficient.
Back to your problem: The duplication already happens in NO2.ab.03 and a number of dates and values occurs twice or more often. The easiest way to solve this (in my experience) is to use the package data.table which has a duplicated which is more straightforward and also faster:
library(data.table)
# Test duplicated occurrences in U.NO2.ab.03
> table(duplicated(U.NO2.ab.03, by = c("DISCOVERY_DATEymd", "NO2.Mean")))
FALSE TRUE
7767 27308
>
> nrow(ind)
[1] 35229
# Remove duplicated rows from data frame
> ind <- ind[!duplicated(ind, by = c("DISCOVERY_DATEymd", "NO2.Mean")), ]
> nrow(ind)
[1] 7921
After these steps, you should be fine :)
I got the answer. The Original data source of NO3.ab.03 was faulty.
As JonGrub suggestes said, the problem was within NO3.ab.O3. For some days it had two different NO3.Means corresponding to the same date. I deleted this rows and now its working good. Thank you again for the help and the great advices

New variable: sum of numbers from a list powered by value of different columns

This is my first question in Stackoverflow. I am not new to R, although I sometimes struggle with things that might be considered basic.
I want to calculate the count median diameter (CMD) for each of my rows from a Particle Size Distribution dataset.
My data looks like this (several rows and 53 columns in total):
date CPC n3.16 n3.55 n3.98 n4.47 n5.01 n5.62 n6.31 n7.08 n7.94
2015-01-01 00:00:00 5263.434 72.988 140.346 138.801 172.473 344.806 484.415 606.430 739.625 927.082
2015-01-01 01:00:00 4813.182 152.823 80.861 140.017 213.382 264.496 359.455 487.293 840.349 1069.846
Each variable starting with "n" indicates the number of particles for the corresponding size (variable n3.16 = number of particles of median size of 3.16nm). I will divide the values by 100 prior to the calculations, in order to avoid such high numbers that prevent from the computation.
To compute the CMD, I need to do the following calculation:
CMD = (D1^n1*D2^n2...Di^ni)^(1/N)
where Di is the diameter (to be extracted from the column name), ni is the number of particles for diameter Di, and N is the total sum of particles (sum of all the columns starting with "n").
To get the Di, I created a numeric list from the column names that start with n:
D <- as.numeric(gsub("n", "", names(data)[3:54]))
This is my attempt to create a new variable with the calculation of CMD, although it doesn't work.
data$cmd <- for i in 1:ncol(D) {
prod(D[[i]]^data[,i+2])
}
I also tried to use apply, but I again, it didn't work
data$cmd <- for i in 1:ncol(size) {
apply(data,1, function(x) prod(size[[i]]^data[,i+2])
}
I have different datasets from different sites which have different number of columns, so I would like to make code "universal".
Thank you very much
This should work (I had to mutilate your date variable because of read.table, but it is not involved in the calculations, so just ignore that):
> df
date CPC n3.16 n3.55 n3.98 n4.47 n5.01 n5.62 n6.31 n7.08 n7.94
1 2015-01-01 5263.434 72.988 140.346 138.801 172.473 344.806 484.415 606.430 739.625 927.082
2 2015-01-01 4813.182 152.823 80.861 140.017 213.382 264.496 359.455 487.293 840.349 1069.846
N <- sum(df[3:11]) # did you mean the sum of all n.columns over all rows? if not, you'd need to edit this
> N
[1] 7235.488
D <- as.numeric(gsub("n", "", names(df)[3:11]))
> D
[1] 3.16 3.55 3.98 4.47 5.01 5.62 6.31 7.08 7.94
new <- t(apply(df[3:11], 1, function(x, y) (x^y), y = D))
> new
n3.16 n3.55 n3.98 n4.47 n5.01 n5.62 n6.31 n7.08 n7.94
[1,] 772457.6 41933406 336296640 9957341349 5.167135e+12 1.232886e+15 3.625318e+17 2.054007e+20 3.621747e+23
[2,] 7980615.0 5922074 348176502 25783108893 1.368736e+12 2.305272e+14 9.119184e+16 5.071946e+20 1.129304e+24
df$CMD <- rowSums(new)^(1/N)
> df
date CPC n3.16 n3.55 n3.98 n4.47 n5.01 n5.62 n6.31 n7.08 n7.94 CMD
1 2015-01-01 5263.434 72.988 140.346 138.801 172.473 344.806 484.415 606.430 739.625 927.082 1.007526
2 2015-01-01 4813.182 152.823 80.861 140.017 213.382 264.496 359.455 487.293 840.349 1069.846 1.007684

Use first digit as factor to standardize values in R

I have large data frame tocalculate from a survey (original data frame brfss2013 where one of the variables represents the number of times a person checks blood glucose levels. The data is in 3 digits:
First digit tells you if the measurements are per day (1), per week (2), per month (3)or per year (4). The second and third digits represent the actual value.
Example: 101 is once ( _01) per day (1 _ _), 202 is twice per week, etc.
I want to standardize everything to get value of times per year. So I will multiply the 2nd and 3rd digits by 365, 52.143, 12 and 1 (days, weeks, months, year).
I think I would be able to "select" the digits to use, but I'm not sure how to write something that can work with different rows with different set of instructions.
EDIT:
Adding my attempt and sample data.
tocalculate <- brfss2013 %>%
filter(nchar(bldsugar) > 2)
bldsugar2 <- sapply(tocalculate$bldsugar, function(x) {
if (substr(x,1,1) == 1) {x*365}
if (substr(x,1,1) == 2) {x*52}
if (substr(x,1,1) == 3) {x*12}
if (substr(x,1,1) == 4) {x*365}
})
I'm getting a lot of NULL values though...
Since you're already using dplyr, recode is a handy function. I use %/% to see how many times 100 goes in to each bldsugar value and %% to get the remainder when divided by 100.
# sample data
brfss_sample = data.frame(bldsugar = c(101, 102, 201, 202, 301, 302, 401, 402))
library(dplyr)
mutate(
brfss_sample,
mult = recode(
bldsugar %/% 100,
`1` = 365.25,
`2` = 52.143,
`3` = 12,
`4` = 1
),
checks_per_year = bldsugar %% 100 * mult
)
# bldsugar mult checks_per_year
# 1 101 365.250 365.250
# 2 102 365.250 730.500
# 3 201 52.143 52.143
# 4 202 52.143 104.286
# 5 301 12.000 12.000
# 6 302 12.000 24.000
# 7 401 1.000 1.000
# 8 402 1.000 2.000
You could, of course, remove the mult column (or combine the definitions so it is never created in the first place).
#Data
set.seed(42)
x = sample(101:499, 100, replace = TRUE)
#1st digit
as.factor(floor((x/100)))
#Values
((x/100) %% 1) * 100
Perhaps the first thing you can do is to split the 3-digit variable into two variables. The first variable is only one digit, which shows sampling frequency; and the second variable shows times of measurement.
In R, substr or substring can select the string by specifying the first and last position to subset.
# Create example data frame
ex_data <- data.frame(var = c("101", "202", "204"))
# Split the variable to create two new columns
ex_data$var1 <- substring(ex_data$var, first = 1, last = 1)
ex_data$var2 <- substring(ex_data$var, first = 2, last = 3)
# Remove the original variable
ex_data$var <- NULL
After this, you can manipulate your data frame. Perhaps convert var1 to factor and var2 to numeric for further manipulation and analysis.

Resources