calculating standard deviation in R on a dataframe - r

I have a dataframe in R.
index seq change change1 change2 change3 change4 change5 change6
1 1 0.12 0.34 1.2 1.7 4.5 2.5 3.4
2 2 1.12 2.54 1.1 0.56 0.87 2.6 3.2
3 3 1.86 3.23 1.6 0.23 3.4 0.75 11.2
... ... ... ... ... ... ... ... ...
The name of the dataframe is just FUllData. I can access each column of the FullData using the code:
FullData[2] for 'change'
FullData[3] for 'change1'
FullData[4] for 'change3'
...
...
Now, I wish to calculate the standard deviation of values in first row of first four columns and so on for all the columns
standarddeviation = sd ( 0.12 0.34 1.2 1.7 )
then
standarddeviation = sd ( 0.34 1.2 1.7 4.5 )
Above has to be for all rows. so basically I want to calulate sd row wise and the data is stored sort of column wise is it possible to do this.
How can I access the row of the data frame with using a for loop on index or seq variable ?
How can I do this in R ? is there any better way ?

I guess you're looking for something like this.
st.dev=numeric()
for(i in 1:dim(FUllData)[1])
{
for(j in 1:dim(FUllData)[2])
{
st.dev=cbind(st.dev,sd(FUllData[i:dim(FUllData)[1],j:dim(FUllData)[2]]))
}
}

Related

Comparing two data frames with the same column and row names

I have two data frames with the same column and row names but have different values.
>data_tls
plot_id max min mean std vol
mf20 20.04 2.23 8.4 3.45 201
mf21 25.24 3.4 4.3 5.5 304
mf22 28.34 5.3 6.2 2.45 240
mf23 30.4 2.05 10.4 6.06 403
>data_uls
plot_id max min mean std vol
mf20 19.09 4.22 6.2 4.45 220
mf21 20.2 2.6 5.3 4.5 305
mf22 32.3 4.3 2.2 3.45 255
mf23 28.4 3.05 8.05 5.85 386
I want to compare the values in these datasets and select the values with more than 20% different. I am trying to use compareDF package example here :https://www.r-bloggers.com/comparing-dataframes-in-r-using-comparedf/.
compareData <- compare_df(data_tls, data_uls, c("Plot_name"))
compareData$comparison_df
However, print(compareData$html_output) returns Null.
I would really appreciate if someone kindly help to solve this or would recommend any other solution.
To get a TRUE/FALSE (logical) matrix use
res <- data_tls > data_uls * 1.2 | data_tls < data_uls * 0.8
Note: The data.frames may contain only numerical columns so you have to remove eg. the plot_id column (or select only the numerical columns in the above expression)!
You can the sum the differences via counting rows or columns like
rowSums(res)
colSums(res)

What is the best way to generate a random dataset from an existing dataset?

Are there any packages in R that can generate a random dataset given a pre-existing template dataset?
For example, let's say I have the iris dataset:
data(iris)
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
I want some function random_df(iris) which will generate a data-frame with the same columns as iris but with random data (preferably random data that preserves certain statistical properties of the original, (e.g., mean and standard deviation of the numeric variables).
What is the easiest way to do this?
[Comment from question author moved here. --Editor's note]
I don't want to sample random rows from an existing dataset. I want to generate actually random data with all the same columns (and types) as an existing dataset. Ideally, if there is some way to preserve statistical properties of the data for numeric variables, that would be preferable, but it's not needed
How about this for a start:
Define a function that simulates data from df by
drawing samples from a normal distribution for numeric columns in df, with the same mean and sd as in the original data column, and
uniformly drawing samples from the levels of factor columns.
generate_data <- function(df, nrow = 10) {
as.data.frame(lapply(df, function(x) {
if (class(x) == "numeric") {
rnorm(nrow, mean = mean(x), sd = sd(x))
} else if (class(x) == "factor") {
sample(levels(x), nrow, replace = T)
}
}))
}
Then for example, if we take iris, we get
set.seed(2019)
df <- generate_data(iris)
str(df)
#'data.frame': 10 obs. of 5 variables:
# $ Sepal.Length: num 6.45 5.42 4.49 6.6 4.79 ...
# $ Sepal.Width : num 2.95 3.76 2.57 3.16 3.2 ...
# $ Petal.Length: num 4.26 5.47 5.29 6.19 2.33 ...
# $ Petal.Width : num 0.487 1.68 1.779 0.809 1.963 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 3 2 1 2 3 2 1 1 2 3
It should be fairly straightfoward to extend the generate_data function to account for other column types.

Repeatedly apply a conditional summary to groups in a dataframe

I have a large dataframe that looks like this:
group_id distance metric
1 1.1 0.85
1 1.1 0.37
1 1.7 0.93
1 2.3 0.45
...
1 6.3 0.29
1 7.9 0.12
2 2.5 0.78
2 2.8 0.32
...
The dataframe is already sorted by group_id and then distance. I want know the dplyr or data.table efficient equivalent to doing the following operations:
Within each group_id:
Let the unique and sorted values of distance within the current group_id be d1,d2,...,d_n.
For each d in d1,d2,...,d_n: Compute some function f on all values of metric whose distance value is less than d. The function f is a custom user defined function, that takes in a vector and returns a scalar. Assume that the function f is well defined on an empty vector.
So, in the example above, the desired dataframe would look like:
group_id distance_less_than metric
1 1.1 f(empty vector)
1 1.7 f(0.85, 0.37)
1 2.3 f(0.85, 0.37, 0.93)
...
1 7.9 f(0.85, 0.37, 0.93, 0.45,...,0.29)
2 2.5 f(empty vector)
2 2.8 f(0.78)
...
Notice how distance values can be repeated, like the value 1.1 under group 1. In such cases, both of the rows should be excluded when the distance is less than 1.1 (in this case this results in an empty vector).
A possible approach is to use non-equi join available in data.table. The left table is the unique set of combinations of group_id and distance and right table are all the distance less than left table's distance.
f <- sum
DT[unique(DT, by=c("group_id", "distance")), on=.(group_id, distance<distance), allow.cartesian=TRUE,
f(metric), by=.EACHI]
output:
group_id distance V1
1: 1 1.1 NA
2: 1 1.7 1.22
3: 1 2.3 2.15
4: 1 6.3 2.60
5: 1 7.9 2.89
6: 2 2.5 NA
7: 2 2.8 0.78
data:
library(data.table)
DT <- fread("group_id distance metric
1 1.1 0.85
1 1.1 0.37
1 1.7 0.93
1 2.3 0.45
1 6.3 0.29
1 7.9 0.12
2 2.5 0.78
2 2.8 0.32")
Don't think this would be faster than data.table option but here is one way using dplyr
library(dplyr)
df %>%
group_by(group_id) %>%
mutate(new = purrr::map_dbl(distance, ~f(metric[distance < .])))
where f is your function. map_dbl expects return type of function to be double. If you have different return type for your function you might want to use map_int, map_chr or likes.
If you want to keep only one entry per distance you might remove them using filter and duplicated
df %>%
group_by(group_id) %>%
mutate(new = purrr::map_dbl(distance, ~f(metric[distance < .]))) %>%
filter(!duplicated(distance))

Using the ddply comand on a subset of data

I got some issues using the command 'ddply' of the 'plyr' package. I created a dataframe which looks like this one :
u v intensity season
24986 -1.97 -0.35 2.0 1
24987 -1.29 -1.53 2.0 1
24988 -0.94 -0.34 1.0 1
24989 -1.03 2.82 3.0 1
24990 1.37 3.76 4.0 1
24991 1.93 2.30 3.0 2
24992 3.83 -3.21 5.0 2
24993 0.52 -2.95 3.0 2
24994 3.06 -2.57 4.0 2
24995 2.57 -3.06 4.0 2
24996 0.34 -0.94 1.0 2
24997 0.87 4.92 5.0 3
24998 0.69 3.94 4.0 3
24999 4.60 3.86 6.0 3
I tried to use the function cumsum on the u and v values, but I don't get what I want. When I select a subset of my data, corresponding to a season, for example :
x <- cumsum(mydata$u[56297:56704]*10.8)
y <- cumsum(mydata$v[56297:56704]*10.8)
...this works perfectly. The thing is that I got a huge dataset (67208 rows) with 92 seasons, and I'd like to make this function work on subsets of data. So I tried this :
new <- ddply(mydata, .(mydata$seasons), summarize, x=c(0,cumsum(mydata$u*10.8)))
...and the result looks like this :
24986 1 NA
24987 1 NA
24988 1 NA
I found some questions related to this one on stackoverflow and other website, but none of them helped me dealing with my problem. If someone has an idea, you're welcome ;)
Don't use your data.frame's name inside the plyr "function". just reference the column name as though it was defined:
ddply(mydata, .(seasons), summarise, x=c(0, cumsum(u*10.8)))

Aggregate a RANGE of values using R language

I have a CSV file having more than 2000rows with 8 columns. The schema of the csv is as follows.
col0 col1 col2 col3......
1.77 9.1 9.2 8.8
2.34 6.3 0.9 0.44
5.34 6.3 0.9 0.44
9.34 6.3 0.9 0.44........
.
.
.
2000rows with data as above
I am trying to aggregate specific sets of rows(set1: rows1-76, set2:rows96-121..) from the above csv e.g between 1.77 to 9.34 and for all the columns for their corresponding rows- the aggregate of these rows would be one row in my output file. I have tried various methods but i could do it for only a single set in the csv file.
The output would be a csv file having aggregate values of the specified intervals like follows.
col0 col1 col2 col3
3.25 8.2 4.4 3.3 //(aggregate of rows 1-3)
2.2 3.3 9.9 1.2 //(aggregate of rows 6-10)
and so on..
Considering what Manetheran points out, you should, if not already done, add a column showing which row belongs to which set.
The data.table-way:
require(data.table)
set.seed(123)
dt <- data.table(col1=rnorm(100),col2=rnorm(100),new=rep(c(1,2),each=50))
dt[,lapply(.SD,mean),by="new"]
new col1 col2
1: 1 0.03440355 -0.25390043
2: 2 0.14640827 0.03880684
You can replace mean with any other "aggregate-function"
Here's a possible approach using the base packages:
# Arguments:
# - a data.frame
# - a list of row ranges passes as list
# of vectors=[startRowIndex,endRowIndex]
# used to split the data.frame into sub-data.frames
# - a function that takes a sub-data.frame and returns
# the aggregated result
aggregateRanges <- function(DF,ranges,FUN){
l <- lapply(ranges,function(x){
return(FUN(DF[x[1]:x[2],]))
}
)
return(do.call(rbind.data.frame,l))
}
# example data
data <- read.table(
header=TRUE,
text=
"col0 col1 col2 col3
1.77 9.1 9.2 8.8
2.34 6.3 0.9 0.44
5.34 6.3 0.9 0.44
9.34 6.3 0.9 0.44
7.32 4.5 0.3 0.42
3.77 2.3 0.8 0.13
2.51 1.4 0.7 0.21
5.44 5.7 0.7 0.18
1.12 6.1 0.6 0.34")
# e.g. aggregate by summing sub-data.frames rows
result <-
aggregateRanges(
data,
ranges=list(c(1,3),c(4,7),c(8,9)),
FUN=function(dfSubset) {
rowsum.data.frame(dfSubset,group=rep.int(1,nrow(dfSubset)))
}
)
> result
col0 col1 col2 col3
1 9.45 21.7 11.0 9.68
11 22.94 14.5 2.7 1.20
12 6.56 11.8 1.3 0.52

Resources