Difference in two datasets

Difference in two datasets - math

My input consists of x and yand the output is the corresponding z. My datasets of course consists of multiple x,y,z. The first dataset I will define as data_1 and the other on as data_2. Now, I would like to compare these two datasets regarding the difference of their ouputs z_1, z_2 .
Question: How could I describe the difference of data1 and data2 in % ? If the percentage description is not suitable, how could I describe the difference in a global way so that the description does not account just one z difference but all the z in datasets?

You can get the average z eg: Average(z_1) of all datasets.

Related

PSM in R with specific lines

to get matched pairs due to PSM ("Matchit"-Package and Method = full) i need to specifiy my command for my longitudinal data frame. Every Case has several obeservations but i only need the first observation per patient to be included in the Matching. So the matching should be based on every patients' first observation but my later analysis should include the complete dataset of each patient with all observations.
Has anyone an idea how to achieve this?
I tried using a data subset (first observation per patient) but wasn't able to get the matching included in the data set (with all observations per patient) using "Match.data".
Thanks in advance
Simon (desperately writing his masters thesis)

My udnerstanding is that you want to create matches at just the first time point but have those matches be identified for each unit at all time points. Fortunatly, this is pretty straightforward: just perform the matching at the first time point and then merge the matched dataset with the full dataset. Here is how this might look. Let's say your original long dataset is d and has an ID column id and a time column time.
m <- matchit(treat ~ X1 + X2, data = subset(d, time == 1), method = "full")
md1 <- match.data(m)
d <- merge(d, md1[c("id", "subclass", "weights")], by = "id", all.x = TRUE)
Your new dataset should have two new columns, subclass and weights, which contain the matching subclass and matching weight for each unit. Rows with identical IDs (i.e., rows corresponding to the same unit at multiple time points) will have the same value of subclass and weight.

R - filtering negative and positive spikes in data

I am plotting a data that consists of some intervals that are more or less constant, and spikes in the data originating from the data being a quotient from two parameters. The relatively high and large quotients aren't not relevant for my purpose, so I have been looking for a way to filter these out. The dataset contains 40k+ values so I can not manually remove the high/low quotients.
Is there any function that can trim/filter out the very large/small quotients?

You can use the filter() function from dplyr. This can create a new dataframe without outliers that you can then plot. For example:
no_spikes <- filter(original_df, x > -100 & x < 100)
This would create a new dataframe, no_spikes, that only contains observations where the variable x is between the values -100 and 100.

How to calculate column mean at intervals of row values in R?

I have dataframe which has 253 rows(locations on a chromosome in Mbps) and 1 column (Allele score at each location). I need to produce a dataframe which contains the mean of the allele score at every 0.5 Mbps on the chromosome. Please help with R code that can do this. thanks.

The picture in this case is adequate to construct an answer but not adequate to support testing. You should learn to post data in a form that doesn't require re-entry by hand. (That's why you are accumulating negative votes.)
The basic R strategy would be to use cut to create a grouping variable and then use a loop construct to accumulate and apply the mean function. Presumably this is in a dataframe which I will assume is named something specific like my_alleles:
tapply( my_alleles$Allele_score, # act on this vector
# in groups defined by this factor
cut(my_alleles$Location,
breaks=seq(0, max(my_alleles$Location), by=0.5)
),
# with this function
FUN=mean)

Transpose/Reshape Data in R

I have a data set in a wide format, consisting of two rows, one with the variable names and one with the corresponding values. The variables represent characteristics of individuals from a sample of size 1000. For instance I have 1000 variables regarding the size of each individual, then 1000 variables with the height, then 1000 variables with the weight etc. Now I would like to run simple regressions (say weight on calorie consumption), the only way I can think of doing this is to declare a vector that contains the 1000 observations of each variable, say for instance:
regressor1=c(mydata$height0, mydata$height1, mydata$height2, mydata$height3, ... mydata$height1000)
But given that I have a few dozen variables and each containing 1000 observations this will become cumbersome. Is there a way to do this with a loop?
I have also thought a about the reshape options of R, but this again will put me in a position where I have to type 1000 variables a few dozen times.
Thank you for your help.

Here is how I would go about your issue. t() will transpose the data for you from many columns to many rows.
Note: t() can be used with a matrix rather than a data frame, I simply coerced to data frame to show my example will work with your data.
# Many columns, 2 rows
x <- as.data.frame(matrix(nrow=2,ncol=1000,seq(1:2000)))
#2 Columns, many rows
t(x)
Based on your comments you are looking to generate vectors.
If you have transposed:
regressor1 <- x[,1]
regressor2 <- x[,2]
If you have not transposed:
regressor1 <- x[1,]
regressor2 <- x[2,]

How to group data to minimize the variance while preserving the order of the data in R

I have a data frame (760 rows) with two columns, named Price and Size. I would like to put the data into 4/5 groups based on price that would minimize the variance in each group while preserving the order Size (which is in ascending order). The Jenks natural breaks optimization would be an ideal function however it does not take the order of Size into consideration.
Basically, I have data simlar to the following (with more data)
Price=c(90,100,125,100,130,182,125,250,300,95)
Size=c(10,10,10.5,11,11,11,12,12,12,12.5)
mydata=data.frame(Size,Price)
I would like to group data, to minimize the variance of price in each group respecting 1) The Size value: For example, the first two prices 90 and 100 cannot be in a different groups since they are the same size & 2) The order of the Size: For example, If Group One includes observations (Obs) 1-2 and Group Two includes observations 3-9, observation 10 can only enter into group two or three.
Can someone please give me some advice? Maybe there is already some such function that I can’t find?

Is this what you are looking for? With the dplyr package, grouping is quite easy. The %>%can be read as "then do" so you can combine multiple actions if you like.
See http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html for further information.
library("dplyr")
Price <– c(90,100,125,100,130,182,125,250,300,95)
Size <- c(10,10,10.5,11,11,11,12,12,12,12.5)
mydata <- data.frame(Size,Price) %>% # "then"
group_by(Size) # group data by Size column
mydata_mean_sd <- mydata %>% # "then"
summarise(mean = mean(Price), sd = sd(Price)) # calculate grouped
#mean and sd for illustration

I had a similar problem with optimally splitting a day into 4 "load blocks". Adjacent time periods must stick together, of course.
Not an elegant solution, but I wrote my own function that first split up a sorted series at specified break points, then calculates the sum(SDCM) using those break points (using the algorithm underlying the jenks approach from Wiki).
Then just iterated through all valid combinations of break points, and selected the set of points that produced the minimum sum(SDCM).
Would quickly become unmanageable as number of possible breakpoints combinations increases, but it worked for my data set.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Difference in two datasets - math

You can get the average z eg: Average(z_1) of all datasets.

Related

PSM in R with specific lines

R - filtering negative and positive spikes in data

How to calculate column mean at intervals of row values in R?

Transpose/Reshape Data in R

How to group data to minimize the variance while preserving the order of the data in R

Categories

Resources