I am new to R. I have daily data and want to separate months with mean less than 1 from rest of data. Do something on daily data (with mean greater than 1). The important thing is not to touch daily values with monthly mean less than 1.
I have used aggregate(file,as.yearmon,mean) to get monthly mean but failing to grasp on how to use it to filter specific month's daily values from analysis. Any suggestion to start would be highly appreciative.
I have reproduced data using a small subset of it and dput:
structure(list(V1 = c(0, 0, 0, 0.43, 0.24, 0, 1.06, 0, 0, 0, 1.57, 1.26, 1.34, 0, 0, 0, 2.09, 0, 0, 0.24)), .Names = "V1", row.names = c(NA, 20L), class = "data.frame")
A snippet of code I am using:
library(zoo)
file <- read.table("text.txt")
x_daily <- zooreg(file, start=as.Date("2000-01-01"))
x1_daily <- x_daily[]
con_daily <- subset(x1_daily, aggregate(x1_daily,as.yearmon,mean) > 1 )
Let's create some sample data:
feb2012 <- data.frame(year=2012, month=2, day=1:28, data=rnorm(28))
feb2013 <- data.frame(year=2013, month=2, day=1:28, data=rnorm(28) + 10)
jul2012 <- data.frame(year=2012, month=7, day=1:31, data=rnorm(31) + 10)
jul2013 <- data.frame(year=2013, month=7, day=1:31, data=rnorm(31) + 10)
d <- rbind(feb2012, feb2013, jul2012, jul2013)
You can get an aggregate of the data column by month like this:
> a <- aggregate(d$data, list(year=d$year, month=d$month), mean)
> a
year month x
1 2012 2 0.09704817
2 2013 2 9.93354271
3 2012 7 10.19073868
4 2013 7 9.78324133
Perhaps not the best way, but an easy way to filter the d data frame by the mean of the corresponding year and month is to work with a temporary data frame that merges d and a, like this:
work <- merge(d, a)
subset(work, x > 1)
I hope this will help you get started!
Related
We frequently ask scale questions in our social surveys; respondents provides their agreement with our statement (strongly agree, agree, neither nor, disagree, strongly disagree). The survey result usually comes in an aggregated format, i.e for each question(variable), the answers are provided in a single column, where 5=strongly agree, 1=strongly disagree etc.
Now we came across a new survey tool where answers were partitions into several columns for one question. For example Q1_1 column = Strongly agree for Q1, Q1_5 column = Strongly disagree. So for each question we received 5 columns of answers, if respondent answered Strongly Agree, Q1_1 related row is marked as 1, where Q1_2 - Q1_5 related row for that respondent are marked as 0.
Please can anyone kindly share a solution to 'aggregated' the answers from the new survey tool, so instead of having 5 columns for each question, we would have one column per question, with value 1-5.
I'm new to R, I thought R would handle this instead of having to manually change in Excel.
Try this approach reshaping and next time follow the advice from #r2evans as we have to type data. Here the code:
library(dplyr)
library(tidyr)
#Data
df <- data.frame(Respondent=paste0('Respondent',1:10),
Q6_1=c(1,0,1,1,1,1,0,0,0,1),
Q6_2=c(0,1,0,0,0,0,1,1,0,1),
Q6_3=rep(0,10),
Q6_4=c(rep(0,8),1,0),stringsAsFactors = F
)
#Code
new <- df %>% pivot_longer(-Respondent) %>%
separate(name,c('variable','answer'),sep='_') %>%
filter(value==1) %>%
select(-value) %>%
filter(!duplicated(Respondent)) %>%
pivot_wider(names_from = variable,values_from=answer)
Output:
# A tibble: 10 x 2
Respondent Q6
<chr> <chr>
1 Respondent1 1
2 Respondent2 2
3 Respondent3 1
4 Respondent4 1
5 Respondent5 1
6 Respondent6 1
7 Respondent7 2
8 Respondent8 2
9 Respondent9 4
10 Respondent10 1
I only curious why your data in case of member 10 have two values of 1. Maybe a typo or is that possible?
We can use data.table methods
library(data.table)
dcast(unique(melt(setDT(df), id.var = 'Respondent')[,
c('variable', 'answer') := tstrsplit(variable, '_',
type.convert = TRUE)][value == 1], by = "Respondent"),
Respondent ~ variable, value.var = 'answer')
-output
# Respondent Q6
# 1: Respondent1 1
# 2: Respondent10 1
# 3: Respondent2 2
# 4: Respondent3 1
# 5: Respondent4 1
# 6: Respondent5 1
# 7: Respondent6 1
# 8: Respondent7 2
# 9: Respondent8 2
#10: Respondent9 4
data
df <- structure(list(Respondent = c("Respondent1", "Respondent2", "Respondent3",
"Respondent4", "Respondent5", "Respondent6", "Respondent7", "Respondent8",
"Respondent9", "Respondent10"), Q6_1 = c(1, 0, 1, 1, 1, 1, 0,
0, 0, 1), Q6_2 = c(0, 1, 0, 0, 0, 0, 1, 1, 0, 1), Q6_3 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), Q6_4 = c(0, 0, 0, 0, 0, 0, 0, 0,
1, 0)), class = "data.frame", row.names = c(NA, -10L))
I have two datasets. Dataset 1 contains columns with range start, range end, and variable Y. Dataset 2 contains columns with range start, range end, variable A, variable B, and variable C.
I want to compare the two sets of ranges in the two Datasets and create a new dataset that has the range overlaps in two columns (i.e. start range overlap, end range overlap) and the all the variables of this overlap (i.e. variable Y, variable A, variable B, variable C)
I am very new to R so I am a bit confused as to how to go about this or even to explain it properly but here is an example that I think will explain it.
Dataset 1:
Start range (96.98, 97.02, 97.06)
End range (97.005, 97.05, 97.095)
Variable Y (1.48, 0.42, 4.78)
Dataset 2:
start range(96.95, 97, 97.05)
end range(97, 97.05, 97.1)
Variable A (100, 50, 10)
Variable B (0, 30, 30)
New Dataset 3:
Start range (96.95, 96.98, 97, 97.005, 97.02, 97.05, 97.06, 97.095)
end range (96.98, 97, 97.005, 97.02, 97.05, 97.06, 97.095, 97.1)
Variable Y (NA, 1.48, 1.48, NA, 0.42, NA, 4.78, NA)
Variable A (100, 100, 50, 50, 50, 10, 10, 10)
Variable B (0, 0, 30, 30, 30, 30, 30, 30)
*note NA is no value - in this case, I still want the columns that don't overlap to be included.
If you just wanted the overlapping ranges, that would be easy: it could be written, for instance, as a SQL join, with sqldf.
library(sqldf)
sqldf("
SELECT MAX(d1.start, d2.start) AS start,
MIN(d1.end, d2.end) AS end,
d1.start AS start1,
d1.end AS end1,
d2.start AS start2,
d2.end AS end2,
d1.Y, d2.A, d2.B, d2.C
FROM d1, d2
WHERE d1.start <= d2.end AND d2.start <= d1.end
")
If you also want the intervals on which there is no overlap, it is trickier;
in particular, a given interval could have several subintervals with no overlap.
One solution is to first compute all those subintervals, by gathering all the endpoints.
dates <- sort( unique( c( d1$start, d1$end, d2$start, d2$end ) ) )
d <- data.frame(
start = dates[-length(dates)],
end = dates[-1]
)
t1 <- sqldf("
SELECT d.start, d.end, d1.Y
FROM d LEFT JOIN d1
ON MAX(d.start, d1.start) < MIN(d.end, d1.end)
")
t2 <- sqldf("
SELECT d.start, d.end, d2.A, d2.B, d2.C
FROM d LEFT JOIN d2
ON MAX(d.start, d2.start) < MIN(d.end, d2.end)
")
sqldf( "SELECT * FROM t1 JOIN t2 USING (start, end)" )
Sample data used:
d1 <- data.frame(
start = c(96.98, 97.02, 97.06),
end = c(97.005, 97.05, 97.095),
Y = c(1.48, 0.42, 4.78)
)
d2 <- data.frame(
start = c(96.95, 97, 97.05),
end = c(97, 97.05, 97.1),
A = c(100,0,0),
B = c(0,0,0),
C = c(0,100,100)
)
I have a dataframe with 3 columns "ID", "on.tank", "on.mains". the data looks like:
ID: 1,2,3,4,5,6,7,8,9,10 ( sequentially) on.tank: 25,0,10,0,43,0,5 on.mains: 0,12,0,11,0,2,0
so columns 2 and 3 alternate between zero and a value, where when one is a zero the other has a value alternately.
I want to create one column that interleaves each value alternately and a second column which will be a factor on.main, on.tank, on.main, etc alternating as it represents days on tank, then days on mains, then days on tank, etc, etc alternately.
I tried using melt but it doesn't give me alternating it stacks the data so I get on.tank, on.tank, on.tank etc for 2000 rows and then on.mains, on.mains etc
> dput(head(data))
structure(list(ID = 1:6, on.tank = c(0, 56, 0, 1, 0, 97), on.main = c(-1,
0, -9, 0, -18, 0)), .Names = c("ID", "on.tank", "on.main"), row.names = c(NA,
6L), class = "data.frame")
Here's your data:
df <- data.frame(ID=1:7,
on.tank=c( 25,0,10,0,43,0,5),
on.mains=c(0,12,0,11,0,2,0))
Using base R:
df$On.which <- ifelse(df$on.tank > df$on.mains, "on.tank", "on.mains")
This will work unless any of your values are negative. If you have negative values use:
df$On.which <- ifelse(df$on.mains==0, "on.tank", "on.mains")
Does this do what you need? If you remove the quotes you can also use this method to get the values of the columns merged into 1.
I'm a trying-to-be R user. I never learned to code properly and have been just doing it by finding stuff online.
I encountered a problem that I would need some of you experts' help.
I have two data files.
Particulate matter (PM) concentrations (~20000 observations)
Coefficient combinations to use with the particulate matter concentrations to calculate final concentrations.
For example..
Data set 1.
ID PM
1 5
2 10
... ...
1500 25
Data set 2.
alpha beta
5 6
1 2
... ...
I ultimately have to use all the coefficient combinations (alpha and beta) for each of the IDs from data set 1. For example, if I have 10 observations in data set 1, and 10 coefficient combinations in data set 2, my output table should have 100 different output values (10*10=100).
for (i in cmaq$FID) {
mean=cmaq$PM*IER$alpha*IER$beta
}
I used the above code to do what I'm trying to do, but it only gave me 10 output values rather than 100. I think using the split function first, and somehow use that with the second dataset would work, but have not figured out how...
It may be a very very simple problem, but after spending hours to figure it out, I thought it may be a better strategy to get some help from R experts.
Thank you in advance!!!
You can do:
df1 = data.frame(
ID = c(1, 2, 1500),
PM = c(5, 10, 25)
)
df2 = data.frame(
alpha = c(5, 6),
beta = c(1, 2)
)
library(tidyverse)
library(dplyr)
df1 %>%
group_by(ID) %>%
do(data.frame(result = .$PM * df2$alpha * df2$beta,
alpha = df2$alpha,
beta = df2$beta))
Look for the term 'cross join' or 'cartesian join' (eg, How to do cross join in R?).
If that doesn't address the issue, please see https://stackoverflow.com/help/mcve. I think there is a mistake inside the loop. beta is free-floating, and not connected to the IER data.frame
We can do this with outer
data.frame(ID = rep(df1$ID, each = nrow(df2)), alpha = df2$alpha,
beta = df2$beta, result = c(t(outer(df1$PM, df2$alpha*df2$beta))))
# ID alpha beta result
#1 1 5 1 25
#2 1 6 2 60
#3 2 5 1 50
#4 2 6 2 120
#5 1500 5 1 125
#6 1500 6 2 300
data
df1 <- structure(list(ID = c(1, 2, 1500), PM = c(5, 10, 25)), .Names = c("ID",
"PM"), row.names = c(NA, -3L), class = "data.frame")
df2 <- structure(list(alpha = c(5, 6), beta = c(1, 2)), .Names = c("alpha",
"beta"), row.names = c(NA, -2L), class = "data.frame")
This question already has answers here:
Stratified random sampling from data frame
(6 answers)
Closed 6 years ago.
I have a dataframe called test.data where I have a column called Ethnicity. There are three groups of ethnicities (more in actual data), Adygei, Balochi and Biaka_pygmies. I want to subset this data frame to include only two samples (rows) randomly from each ethnic group and get the result. How can I do this in R?
test.data <- structure(list(Sample = c("1793102418_A", "1793102460_A", "1793102500_A",
"1793102576_A", "1749751113_A", "1749751187_A", "1749751189_A",
"1749751285_A", "1749751356_A", "1749751195_A", "1749751218_A",
"1775705355_A"), Ethnicity = c("Adygei", "Adygei", "Adygei",
"Adygei", "Balochi", "Balochi", "Balochi", "Balochi", "Balochi",
"Biaka_Pygmies", "Biaka_Pygmies", "Biaka_Pygmies"), Height = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("Sample", "Ethnicity",
"Height"), row.names = c("1793102418_A", "1793102460_A", "1793102500_A",
"1793102576_A", "1749751113_A", "1749751187_A", "1749751189_A",
"1749751285_A", "1749751356_A", "1749751195_A", "1749751218_A",
"1775705355_A"), class = "data.frame")
result
Sample Ethnicity Height
1793102418_A 1793102418_A Adygei 0
1793102460_A 1793102460_A Adygei 0
1749751189_A 1749751189_A Balochi 0
1749751285_A 1749751285_A Balochi 0
1749751195_A 1749751195_A Biaka_Pygmies 0
1775705355_A 1775705355_A Biaka_Pygmies 0
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(test.data)), grouped by 'Ethnicity', we sample the sequence of rows and subset the rows based on that.
setDT(test.data)[, .SD[sample(1:.N,2)], Ethnicity]
Or using tapply from base R
test.data[ with(test.data, unlist(tapply(seq_len(nrow(test.data)),
Ethnicity, FUN = sample, 2))), ]