I got column means and range(min, max) from my data.
df=matrix(c(3, 5, 2, 3, 6, 3,4, 4, 4, 5, 4, 3,5, 5, 5),ncol=3,byrow=TRUE)
colnames(df)<-paste0("ch", 1:ncol(df))
rownames(df)<-paste0("G", 1:nrow(df))
mean<- colMeans(df, na.rm = FALSE, dims = 1)
range<-apply(df, 2, range)
rownames(range) <- c("min","max")
res<-rbind(mean,range)
I have a standard mean value(4). Now I want to add additional row showing significant marks(**) with the existing output. Mean values less than 4 were considered significant. Somehow I got significant marks but I failed to add this with the existing result.
f<-res[1,] <4
test <- factor(f, labels=c("Ns", "**"))
result<-rbind(mean,range,test)
result
ch1 ch2 ch3
mean 4 4.8 3.4
min 3 4.0 2.0
max 5 6.0 5.0
test 1 1.0 2.0
I wanted this like following one
ch1 ch2 ch3
mean 4 4.8 3.4
min 3 4.0 2.0
max 5 6.0 5.0
test Ns Ns **
I need your help to solve this.
rbind.data.frame(mean = mean, range, test = as.character(test))
# ch1 ch2 ch3
# mean 4 4.8 3.4
# min 3 4 2
# max 5 6 5
# test Ns Ns **
See ?rbind.data.frame for a detail.
I think, Matrix can only store the data that have same type. Here, the first three rows are numeric. However, the test is factor, and it's coerced to numeric, that Ns and ** mapping to 1 and 2.
I suggest you should use data.frame to do it.
res<-rbind(mean,range)
res<-data.frame(t(res))
f<-res[1,] <4
test <- factor(f, labels=c("Ns", "**"))
res<-cbind(res,test)
I hope this ansewer can help you!
Related
I'd like to sum rows 2 by 2, in order to study the lag of certain variable.
Suppose that I have the following the data base:
> SE eggs
4 2.0
6 4.0
7 10.0
8 0.5
5 1.0
1 3.0
2 6.0
3 9.0
So, I expected to obtain the following, where eggsare the sum of the indexes "SE"'s:
> df
SE2 eggs
"4+5" 3
"6+7" 14
"8+1" 3.5
"2+3" 15
Where
df = data.frame(SE=c(4,6,7,8,5,1,2,3),eggs = c(2,4,10,0.5,1,3,6,9))
Obs.: Don't mater the order of the data frame, but I need to start from certain number (in this case, number 4), and then take the next number, in this case, number 5, and keep this logic. After SE 6+7, SE 8+1, SE 2+3...
Any hint on how can I do that?
I think I get the logic. You want ascending numbers starting from 4. When these numbers reach 8 (or whatever the maximum value of SE is), they wrap around back to one and continue to ascend until all the numbers are used up.
You then group these numbers into sequential pairs.
For each pair of numbers, you find the rows of your data frame with the matching values of SE. These rows contain the two values of eggs you wish to sum.
df = data.frame(SE=c(4,6,7,8,5,1,2,3),eggs = c(2,4,10,0.5,1,3,6,9))
first <- 4
i <- match(df$SE, c(first:nrow(df), seq(first - 1)))
groups <- ((seq_along(i) + 1) %/% 2)[i]
do.call(rbind, lapply(split(df, groups), function(x) {
data.frame(SE = paste(x$SE, collapse = "+"), eggs = sum(x$eggs))}))
#> SE eggs
#> 1 4+5 3.0
#> 2 6+7 14.0
#> 3 8+1 3.5
#> 4 2+3 15.0
Created on 2020-02-17 by the reprex package (v0.3.0)
Match c(4:8, 1:3) to SE using the match indexes to index into eggs, reshape into a 2x4 matrix and sum each column.
k <- 4 # starting index
nr <- nrow(df) # no of rows in df
with(df, colSums(matrix(eggs[match(c(k:nr, seq_len(k-1)), SE)], 2)))
## [1] 3.0 14.0 3.5 15.0
Another option, just a slight variation on my comment where we re-arrange the rows according to the specified logic and then aggregate every two rows:
aggregate(
eggs ~ ceiling(seq_along(SE)/2),
FUN = sum,
data = df[with(df, order(factor(SE, levels = c(seq(SE[1], max(SE)), SE[!SE %in% seq(SE[1], max(SE))])))),]
)[, -1]
[1] 3.0 14.0 3.5 15.0
Or, if you'd like to keep the SE in the specified format:
df <- aggregate(
. ~ ceiling(seq_along(SE)/2),
FUN = paste, collapse = '+',
data = df[with(df, order(factor(SE, levels = c(seq(SE[1], max(SE)), SE[!SE %in% seq(SE[1], max(SE))])))),]
)[, -1]
df$eggs <- sapply(df$eggs, function(x) eval(parse(text = x)))
Output:
df
SE eggs
1 4+5 3.0
2 6+7 14.0
3 8+1 3.5
4 2+3 15.0
I've got a numeric variable in the range from 1 (min) to 5 (max). The value ranges across 8 different variables. Therefore, the first row would look like this:
Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8
4 4 1 4 5 4 4 1
I've computed (row-wise) a median value for each row across the 8 variables. Occasionally, the median will be a midpoint value, for example, 4.5 (since it's even number of variables). Therefore the resulting row might look like this:
Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Median
1 2 3 4 5 5 5 5 4.5
When I call table on each median value calculated in Medina variable, I'll get this:
table(df$Median)
1 1.5 2 2.5 3 3.5 4 4.5 5
2 3 10 5 25 17 75 53 87
The issue I am trying to overcome is that I wish to "get rid" of the midpoint/decimal values by including them in the nearest nondecimal values; however, if I simply use round(), then I end up biasing the values (as by definition the 4.5 is really in between), like this:
table(round(df$Median))
1 2 3 4 5
2 18 25 145 87
What I was thinking of doing would be to round values based on the proportion of non-decimal numbers in the table (excluding the midpoint values):
So I would get proportion of non-decimal numbers using dplyr filter functions:
df %>% filter(median %% 1 == 0) %>%
select(median) %>% table() %>% prop.table()
To get:
1 2 3 4 5
0.01005025 0.05025126 0.12562814 0.37688442 0.43718593
Next step requires constructing a function that will take all midpoint values in the median variable and round them to their nearest non-decimal values while keeping the proportion of the non-decimal variables intact or close to the original one. For example, 4.5 nearest values are 4 and 5, so it would have a chance of going becoming 4 based on proportion 0.37688442 and 5 based on proportion 0.43718593. This way I would transform midpoint values to the whole; however, it would not be as biased as using simply round().
An alternative approach is to split the value equally between 4 and 5. So 50% of variables with value 4.5 will go to 4, 50% will go to 5.
I am thankful for any suggestions that would help me to solve this problem or get to the point I can start developing the function.
Edit1. Provided my own attempt to answer this question.
Edit2. Provided data.
dput(head(df, 15))
structure(list(uniqueID = c("R_AtXpiwxKPvILFv3", "R_2xwP4iz6UAu1fTj",
"R_b8IXGRKHP58x7GR", "R_ZelynHN8PCxxYyt", "R_PNjIc7h4dHebRgR",
"R_2bTZvYLUuKNC22D", "R_3iLqwuDs493HstB", "R_291dITimLKjYXeL",
"R_YWWGleFLxlIYzrX", "R_3st91vjNWNXlTHt", "R_3Mm8P52gaaxIpwD",
"R_3MxHXTnrncpgnB8", "R_1LqDx1uxReOQHvO", "R_vJEGJDmbqdfO7qF",
"R_3q8Wl8qys6nqxBH"), Median = c(4, 4.5,
1, 4, 5, 4.5, 4, 1.5, 4.5, 4, 3.5, 2, 4.5, 4.5, 3.5)), .Names = c("uniqueID",
"Median"), row.names = c(NA, -15L), class = c("tbl_df",
"tbl", "data.frame"))
I'd implement it like this:
round_randomly = function(x, tolerance = 1e-6) {
round(x + sample(c(-tolerance, tolerance), size = length(x), replace = TRUE))
}
Calling your sample data dd,
table(round_randomly(dd$Median))
# 1 2 4 5
# 1 2 8 4
Any tolerance value less than 0.5 will work the same if your data is only integers and 0.5. If you have more continuous data, a smaller tolerance is better (to prevent, say 4.4 from being jittered up to 4.51 and being rounded to 5). I set the default to 1e-6, which seems reasonable, a value > 4.499999 might get rounded up to 5.
Your answer goes to quite a bit of trouble to only add a random value to the midpoints - this isn't necessary because of the rounding. If the original value is 4, 4.000001 will still round to 4. Even if you set the tolerance to 0.4, 4.4 will still round to 4).
My method makes no guarantees about rounding exactly 50% of midpoints up and 50% down, but each midpoint is rounded up and down with equal probability. Unless you have very little data and an unusually skewed random draw, that should be close enough.
Following suggestion from comments, I've attempted to create a function that randomly adds 0.1 or subtract 0.1 from all median midpoint values. It's not exactly the most elegant function ever but it does the job. One issue with the approach might be that randomization occurs by randomly sampling fraction of the dataset and adding 0.1 to it. Therefore, remaining unsampled fraction automatically gets to be subtracted by 0.1. It would be more elegant to do this for every value individually but I would have to explore this option.
The function:
randomize_midpoint <- function(dataset, new_random_median) {
# Prepare variable for mutate
new_random_median <- enquo(new_random_median)
# Get Sample A
sample_A <- dataset %>%
filter(Median %% 1 != 0) %>% # get midpoint values
sample_frac(0.5, replace = F) %>% # randomly sample 50% of them
select(uniqueID, Median) # anti_join will need some unique identifier
# Get Sample B by anti_join
sample_B <- dataset %>%
filter(Median %% 1 != 0) %>%
anti_join(sample_A) %>% # anti_join automatically uses uniqueID
select(uniqueID, Median)
# Create opposite of %in%
"%w/o%" <- Negate("%in%")
# Mutate median according to conditions in case_when()
dataset %>% mutate(
!!quo_name(new_random_median) := case_when(
uniqueID %in% sample_A$uniqueID ~ round(Median + 0.1),
uniqueID %in% sample_B$uniqueID ~ round(Median - 0.1),
uniqueID %w/o% c(sample_A$uniqueID , sample_B$uniqueID) ~ Median
)
)
}
The output of the function to compare with previous table():
randomize_midpoint(dataset = df, new_random_median = random_med) %>%
select(random_med) %>%
table()
Will return:
Joining, by = c("uniqueID", "Median")
1 2 3 4 5
2 16 36 110 113
Previous table:
table(round(df$Median))
1 2 3 4 5
2 18 25 145 87
Couldn't find something too similar to this online, it's a seemingly easy data manipulation problem I have been struggling with. I create a vector of distances that looks as such:
distances = c(3, 5, 7, 9, 2.3, 5.2, 1.8, 2.3, 9, 0.75, 14, 11, 4.4, 12, 13)
distances will always be a vector of some length that is a multiple of 5, in my case length(distances) == 15. I'm trying to obtain this:
output = c(2.3, 0.75, 4.4)
here, 2.3 is the minimum of the first 5 elements, 0.75 is the minimum of elements 6:10, and 4.4 the min of elements 11:15. This feels like it lends itself to the apply functions, but i'm not too familiar with them. any help appreciated!
here are some possibilities:
1) apply/matrix Form a matrix with 5 rows from distances stringing out the vector column by column and then take the minimum of each column:
apply(matrix(distances, 5), 2, min)
## [1] 2.30 0.75 4.40
2) zoo::rollapply An alternative is to use rollapply from the zoo package specifying that we wish to take the minimum of every 5 elements and skipping by 5 to the next set of 5 elements repeatedly.
library(zoo)
rollapply(distances, 5, by = 5, min)
## [1] 2.30 0.75 4.40
3) tapply/gl Since there are length(distances)/5 = 15/5 = 3 groups, each of length 5:
tapply(distances, gl(3, 5), min)
## 1 2 3
## 2.30 0.75 4.40
4) tapply/col This is similar to (3); however, in place of gl is uses col(matrix(...)) borwwing the matrix idea from (1):
tapply(distances, col(matrix(distances, 5)), min)
## 1 2 3
## 2.30 0.75 4.40
I have a large dataset for which I need to generate multiple cross-tables. These are particularly two dimensional tables to generate frequencies along with mean and SD.
So give an example I have the below data -
City <- c("A","B","A","A","B","C","D","A","D","C")
Q1 <- c("Agree","Agree","Agree","Agree","Agree","Neither","Neither","Disagree","Agree","Agree")
df <- data.frame(City,Q1)
Keeping the data in mind, I want to generate a cross-table with mean as below -
City
A B C D
Agree 3 2 1 1
Neither 1 1
Disagree 1
Total 4 2 2 2
Mean 2.5 3 2.5 2.5
When generating the mean, Agree is given a weight of 3, Neither is given a weight of 2 and Disagree is a given a weight of 1. The cross-table output should have the mean just below the Total column. It would be good to have gridlines between each column and row.
Can you please suggest how to achieve this in R?
Here's a possible solution using addmargins which allows you to pass predefined functions to your table result
wm <- function(x) sum(x * c(3, 1, 2)) / sum(x)
addmargins(table(df[2:1]), 1, list(list(Total = sum, Mean = wm)))
# City
# Q1 A B C D
# Agree 3.0 2.0 1.0 1.0
# Disagree 1.0 0.0 0.0 0.0
# Neither 0.0 0.0 1.0 1.0
# Total 4.0 2.0 2.0 2.0
# Mean 2.5 3.0 2.5 2.5
If you want SD to, you can simply add , SD = sd to the functions list
Here's a solution:
x <- table(df$Q1, df$City) #building basic crosstab
#assigning weights to vector
weights <- c("Agree" = 3, "Disagree" = 1, "Neither" = 2)
#getting weighted mean
weightedmean <- apply(x, 2, function(x) {sum(x * weights)/sum(x)})
#building out table
x <- rbind(x,
apply(x, 2, sum), #row sums
weightedmean)
rownames(x)[4:5] <- c("Total", "Mean")
I'm wondering if there is an easy way to average over the previous 30 seconds of data in R when there may be more than one data point per second.
For instance, for the sample weight taken at 32 seconds, I want the mean of the concentrations recorded in the past 30 seconds, so the mean of 9, 10, 7, ..14,20, 18, 2). For the sample weight taken at 31 seconds,I want the mean of the concentrations recorded in the past 30 seconds, so the mean of 5, 9, 10, 7, .. 14,20, 18). It's technically not a rolling average over the 30 previous measurements because there can be more than one measurement per second.
I'd like to do this in R.
1) sqldf Using DF below and 3 seconds join the last three seconds of data to each row of DF and then take the mean over them:
DF <- data.frame(time = c(1, 2, 2, 3, 4, 5, 6, 7, 8, 10), data = 1:10)
library(sqldf)
sqldf("select a.*, avg(b.data) mean
from DF a join DF b on b.time between a.time - 3 and a.time
group by a.rowid")
giving:
time data mean
1 1 1 1.0
2 2 2 2.0
3 2 3 2.0
4 3 4 2.5
5 4 5 3.0
6 5 6 4.0
7 6 7 5.5
8 7 8 6.5
9 8 9 7.5
10 10 10 9.0
The first mean value is the mean(1) which is 1, the second and third mean values are mean(1:3) which is 2, the fourth mean value is mean(1:4) which is 2.5, the fifth mean value is mean(1:5) which is 3, the sixth mean value is mean(2:6) which is 4, the seventh mean value is mean(3:7) which is 5 and so on.
2) This 2nd solution uses no packages. For each row of DF it finds the rows within 3 seconds back and takes the mean of their data:
Mean3 <- function(i) with(DF, mean(data[time <= time[i] & time >= time[i] - 3]))
cbind(DF, mean = sapply(1:nrow(DF), Mean3))
The rollapply function should do the trick.
library(zoo)
rollapply(weight.vector, 30, mean)
You can do (assuming your data is stored in a dataframe called df):
now <- 32
step <- 30
subsetData <- subset(df, time >= (now-step) & time < now)
average <- mean(subsetData$concentration)
And if you want to calculate the mean for at more time points, you can put this in a loop where you must adjust now
My first idea would be to summarise the data so the value column would contain a list of all values.
test.data <- data.frame(t = 1:50 + rbinom(50, 30, 0.3), y=rnorm(50)) %>% arrange(t)
prep <- test.data %>% group_by(t) %>% summarise(vals = list(y))
wrk <- left_join(data.frame(t=1:max(test.data$t)), prep, by='t')
Unfortunately zoos rollapply would not work on such a data.frame.
For testing I was thinking to only use a window of 5 lines.
I tried commands along: rollapply(wrk, 5, function(z) mean(unlist(z)))
But maybe someone else can fill in the missing bit of information.
This is sufficiently different that it warrants another answer.
This should do what you're asking with no extra libraries needed.
It just loops through each row, filters based on that row's time, and computes the mean.
Don't fear a simple loop :)
count = 200 # dataset rows
windowTimespan = 30 # timespan of window
# first lets make some data
df = data.frame(
# 200 random numbers from 0-99
time = sort(floor(runif(count)*100)),
concentration = runif(count),
weight = runif(count)
)
# add placeholder column(s)
df$rollingMeanWeight = NA
df$rollingMeanConcentration = NA
# for each row
for (r in 1:nrow(df)) {
# get the time in this row
thisTime = df$time[r]
# find all the rows within the acceptable timespan
# note: figure out if you want < vs <=
thisSubset = df[
df$time < thisTime &
df$time >= thisTime-windowTimespan
,]
# get the mean of the subset
df$rollingMeanWeight[r] = mean(thisSubset$weight)
df$rollingMeanConcentration[r] = mean(thisSubset$concentration)
}