I have two data frames with the same column and row names but have different values.
>data_tls
plot_id max min mean std vol
mf20 20.04 2.23 8.4 3.45 201
mf21 25.24 3.4 4.3 5.5 304
mf22 28.34 5.3 6.2 2.45 240
mf23 30.4 2.05 10.4 6.06 403
>data_uls
plot_id max min mean std vol
mf20 19.09 4.22 6.2 4.45 220
mf21 20.2 2.6 5.3 4.5 305
mf22 32.3 4.3 2.2 3.45 255
mf23 28.4 3.05 8.05 5.85 386
I want to compare the values in these datasets and select the values with more than 20% different. I am trying to use compareDF package example here :https://www.r-bloggers.com/comparing-dataframes-in-r-using-comparedf/.
compareData <- compare_df(data_tls, data_uls, c("Plot_name"))
compareData$comparison_df
However, print(compareData$html_output) returns Null.
I would really appreciate if someone kindly help to solve this or would recommend any other solution.
To get a TRUE/FALSE (logical) matrix use
res <- data_tls > data_uls * 1.2 | data_tls < data_uls * 0.8
Note: The data.frames may contain only numerical columns so you have to remove eg. the plot_id column (or select only the numerical columns in the above expression)!
You can the sum the differences via counting rows or columns like
rowSums(res)
colSums(res)
Related
I have to make a histogram from the given text file hw4aldData containing:
170 172 173 174 174 175 176 177 180 180 180 180 180 181 181 182 182 182 182 184 184 185 186 188
0.84 1.31 1.42 1.03 1.07 1.08 1.04 1.80 1.45 1.60 1.61 2.13 2.15 0.84 1.43 0.90 1.81 1.94 2.68 1.49 2.52 3.00 1.87 3.08
But each data set shows up as a different column in R like:
v1 v2 v3 v4 v5
TankTemp 170 172 173 174
EffRat 0.84 1.31 1.42 1.03
There are many more data points but I just wanted to show what it looks like. I need to make a histogram for tanktemp and effrate.
I know how to separate columns to make a histogram:
hist(hw4aldData$v1)
I know how to switch into a transpose matrix:
t(hw4aldData)
but that doesn't work with the names of the rows at the beginning of the columns.
but I'm not sure how to make a histogram using all the data points in this form, from each of the tanktemp and effrat data.
Any help is welcome, thanks.
The first step in asking a question on Stack Overflow is to create a reproducible example. That is a small example that users can input into their computers to test, diagnose, and solve your issue. It not only helps others but it also enables you to properly assess your problem and potentially find a solution while creating the example.
Example
We use the built-in iris data set for values. We only need a few rows and the "Species" label as the first column to look like your example:
df <- iris[c(1,80,150),c(5,1:4)]
df
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1 setosa 5.1 3.5 1.4 0.2
# 80 versicolor 5.7 2.6 3.5 1.0
# 150 virginica 5.9 3.0 5.1 1.8
That only took one line and is very helpful in visualizing and sharing the problem you are facing.
Reproduce the error
You did not show the error you are receiving but we can show it:
hist(df[1,])
Error in hist.default(df[1, ]) : 'x' must be numeric
hist(t(df[,1]))
We found the problem, the first column has text and the others do not.
Solution
Let's create row names to call from and delete the first column:
row.names(df) <- df[,1]
df <- df[-1]
df
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# setosa 5.1 3.5 1.4 0.2
# versicolor 5.7 2.6 3.5 1.0
# virginica 5.9 3.0 5.1 1.8
Now we can create the histogram by name. Let's try the "setosa" row:
hist(unlist(df["setosa",]))
Perfect. Cheers.
I have a dataframe in R.
index seq change change1 change2 change3 change4 change5 change6
1 1 0.12 0.34 1.2 1.7 4.5 2.5 3.4
2 2 1.12 2.54 1.1 0.56 0.87 2.6 3.2
3 3 1.86 3.23 1.6 0.23 3.4 0.75 11.2
... ... ... ... ... ... ... ... ...
The name of the dataframe is just FUllData. I can access each column of the FullData using the code:
FullData[2] for 'change'
FullData[3] for 'change1'
FullData[4] for 'change3'
...
...
Now, I wish to calculate the standard deviation of values in first row of first four columns and so on for all the columns
standarddeviation = sd ( 0.12 0.34 1.2 1.7 )
then
standarddeviation = sd ( 0.34 1.2 1.7 4.5 )
Above has to be for all rows. so basically I want to calulate sd row wise and the data is stored sort of column wise is it possible to do this.
How can I access the row of the data frame with using a for loop on index or seq variable ?
How can I do this in R ? is there any better way ?
I guess you're looking for something like this.
st.dev=numeric()
for(i in 1:dim(FUllData)[1])
{
for(j in 1:dim(FUllData)[2])
{
st.dev=cbind(st.dev,sd(FUllData[i:dim(FUllData)[1],j:dim(FUllData)[2]]))
}
}
I'm working on two datasets, derrived fromm cats, an in-build R dataset.
> cats
Sex Bwt Hwt
1 F 2.0 7.0
2 F 2.0 7.4
3 F 2.0 9.5
4 F 2.1 7.2
5 F 2.1 7.3
6 F 2.1 7.6
7 F 2.1 8.1
8 F 2.1 8.2
9 F 2.1 8.3
10 F 2.1 8.5
11 F 2.1 8.7
12 F 2.1 9.8
...
137 M 3.6 13.3
138 M 3.6 14.8
139 M 3.6 15.0
140 M 3.7 11.0
141 M 3.8 14.8
142 M 3.8 16.8
143 M 3.9 14.4
144 M 3.9 20.5
I want to find the 99% Confidence Interval on the difference of means values between the Bwt of Male and Female specimens (Sex == M and Sex == F respectively)
I know that t.test does this, among other things, but if I break up cats to two datasets that contain the Bwt of Males and Females, t.test() complains that the two datasets are not of the same length, which is true. There's only 47 Females in cats, and 87 Males.
Is it doable some other way or am I misinterpreting data by breaking them up?
EDIT:
I have a function suggested to me by an Answerer on another Question that gets the CI of means on a dataset, may come in handy:
ci_func <- function(data, ALPHA){
c(
mean(data) - qnorm(1-ALPHA/2) * sd(data)/sqrt(length(data)),
mean(data) + qnorm(1-ALPHA/2) * sd(data)/sqrt(length(data))
)
}
You should apply the t.test with the formula interface:
t.test(Bwt ~ Sex, data=cats, conf.level=.99)
Alternatively to t.test, if you really only interested in the difference of means, you can use:
DescTools::MeanDiffCI(cats$Bwt, cats$Sex)
which gives something like
meandiff lwr.ci upr.ci
-23.71474 -71.30611 23.87662
This is calculated with 999 bootstrapped samples by default. If you want more, you can specify this in the R parameter:
DescTools::MeanDiffCI(cats$Bwt, cats$Sex, R = 1000)
I got some issues using the command 'ddply' of the 'plyr' package. I created a dataframe which looks like this one :
u v intensity season
24986 -1.97 -0.35 2.0 1
24987 -1.29 -1.53 2.0 1
24988 -0.94 -0.34 1.0 1
24989 -1.03 2.82 3.0 1
24990 1.37 3.76 4.0 1
24991 1.93 2.30 3.0 2
24992 3.83 -3.21 5.0 2
24993 0.52 -2.95 3.0 2
24994 3.06 -2.57 4.0 2
24995 2.57 -3.06 4.0 2
24996 0.34 -0.94 1.0 2
24997 0.87 4.92 5.0 3
24998 0.69 3.94 4.0 3
24999 4.60 3.86 6.0 3
I tried to use the function cumsum on the u and v values, but I don't get what I want. When I select a subset of my data, corresponding to a season, for example :
x <- cumsum(mydata$u[56297:56704]*10.8)
y <- cumsum(mydata$v[56297:56704]*10.8)
...this works perfectly. The thing is that I got a huge dataset (67208 rows) with 92 seasons, and I'd like to make this function work on subsets of data. So I tried this :
new <- ddply(mydata, .(mydata$seasons), summarize, x=c(0,cumsum(mydata$u*10.8)))
...and the result looks like this :
24986 1 NA
24987 1 NA
24988 1 NA
I found some questions related to this one on stackoverflow and other website, but none of them helped me dealing with my problem. If someone has an idea, you're welcome ;)
Don't use your data.frame's name inside the plyr "function". just reference the column name as though it was defined:
ddply(mydata, .(seasons), summarise, x=c(0, cumsum(u*10.8)))
I have a CSV file having more than 2000rows with 8 columns. The schema of the csv is as follows.
col0 col1 col2 col3......
1.77 9.1 9.2 8.8
2.34 6.3 0.9 0.44
5.34 6.3 0.9 0.44
9.34 6.3 0.9 0.44........
.
.
.
2000rows with data as above
I am trying to aggregate specific sets of rows(set1: rows1-76, set2:rows96-121..) from the above csv e.g between 1.77 to 9.34 and for all the columns for their corresponding rows- the aggregate of these rows would be one row in my output file. I have tried various methods but i could do it for only a single set in the csv file.
The output would be a csv file having aggregate values of the specified intervals like follows.
col0 col1 col2 col3
3.25 8.2 4.4 3.3 //(aggregate of rows 1-3)
2.2 3.3 9.9 1.2 //(aggregate of rows 6-10)
and so on..
Considering what Manetheran points out, you should, if not already done, add a column showing which row belongs to which set.
The data.table-way:
require(data.table)
set.seed(123)
dt <- data.table(col1=rnorm(100),col2=rnorm(100),new=rep(c(1,2),each=50))
dt[,lapply(.SD,mean),by="new"]
new col1 col2
1: 1 0.03440355 -0.25390043
2: 2 0.14640827 0.03880684
You can replace mean with any other "aggregate-function"
Here's a possible approach using the base packages:
# Arguments:
# - a data.frame
# - a list of row ranges passes as list
# of vectors=[startRowIndex,endRowIndex]
# used to split the data.frame into sub-data.frames
# - a function that takes a sub-data.frame and returns
# the aggregated result
aggregateRanges <- function(DF,ranges,FUN){
l <- lapply(ranges,function(x){
return(FUN(DF[x[1]:x[2],]))
}
)
return(do.call(rbind.data.frame,l))
}
# example data
data <- read.table(
header=TRUE,
text=
"col0 col1 col2 col3
1.77 9.1 9.2 8.8
2.34 6.3 0.9 0.44
5.34 6.3 0.9 0.44
9.34 6.3 0.9 0.44
7.32 4.5 0.3 0.42
3.77 2.3 0.8 0.13
2.51 1.4 0.7 0.21
5.44 5.7 0.7 0.18
1.12 6.1 0.6 0.34")
# e.g. aggregate by summing sub-data.frames rows
result <-
aggregateRanges(
data,
ranges=list(c(1,3),c(4,7),c(8,9)),
FUN=function(dfSubset) {
rowsum.data.frame(dfSubset,group=rep.int(1,nrow(dfSubset)))
}
)
> result
col0 col1 col2 col3
1 9.45 21.7 11.0 9.68
11 22.94 14.5 2.7 1.20
12 6.56 11.8 1.3 0.52