How to apply same function to different sets of column in R? - r

With the following data set and time variable as time=c(1:10)
mydata
beta_C1 1 beta_C1 2 beta_C1 3 beta_C2 1 beta_C2 2 beta_C2 3
1 5.388135 0.2036038 -0.006050338 5.488691 0.1778483 -0.0036647072
2 5.536004 0.2374793 -0.009960762 5.768781 0.1463565 -0.0012642700
3 5.798095 0.1798015 -0.004768584 6.059320 0.1127296 0.0006366231
4 5.648306 0.2720582 -0.011654632 6.129815 0.1282014 -0.0015109727
5 5.712576 0.2320445 -0.007225099 6.166659 0.1490687 -0.0042889325
6 5.674026 0.2325392 -0.006198976 6.242121 0.1559551 -0.0064668515
I would like to create two matrix such as
new_mat1=outer(1:nrow(mydata), 1:length(time), function(x,y){
mydata[x,1]+
mydata[x,2]*time[y]+
mydata[x,3]*time[y]^2
})
and
new_mat2=outer(1:nrow(mydata), 1:length(time), function(x,y){
mydata[x,4]+
mydata[x,5]*time[y]+
mydata[x,6]*time[y]^2
})
The first matrix is created by taking the first three columns of mydata and the last three columns are used to create the second matrix.
Can I apply a function or for loop to create both matrices together? Any help is appreciated

Related

How to setup two dynamic conditions in SUMIFS like problem in R?

I already tried my best but am still pretty much a newbie to R.
Based on like 500mb of input data that currently looks like this:
TOTALLISTINGS
listing_id calc.latitude calc.longitude reviews_last30days
1 2818 5829821 335511.0 1
2 20168 5829746 335265.2 3
3 25428 5830640 331534.6 0
4 27886 5832156 332003.1 3
5 28658 5830888 329727.2 3
6 28871 5829980 332071.3 7
I need to calculate the conditional sum of reviews_last30days - the conditions being a specific and changing area range for each respective record, i.e. R should sum only those reviews for which the calc.latitude and calc.longitude do not deviate more than +/-500 from the longitude and latitude values in each row.
EXAMPLE:
ROW 1 has a calc.latitude 5829821 and a calc.longitude 335511.0, so R should take the sum of all reviews_last30days for which the following ranges apply: calc.latitude 5829321‬ to 5830321‬ (value of Row 1 latitude +/-500)
calc.longitude 335011.0 to 336011.0 (value of Row 1 longitude +/-500)
So my intended output would look somewhat like this in column 5:
TOTALLISTINGS
listing_id calc.latitude calc.longitude reviews_last30days reviewsper1000
1 2818 5829821 335511.0 1 4
2 20168 5829746 335265.2 3 4
3 25428 5830640 331534.6 0 10
4 27886 5832156 332003.1 3 3
5 28658 5830888 331727.2 3 10
6 28871 5829980 332071.3 7 10
Hope I calculated correctly in my head, but you get the idea..
Until now I particularly struggle with the fact that my sum conditions are dynamic and "newly assigned" since the latitude and longitude conditions have to be adjusted for each record.
My current code looks like this but it obviously doesn't work that way:
review1000 <- function(TOTALLISTINGS = NULL){
# tibble to return
to_return <- TOTALLISTINGS %>%
group_by(listing_id) %>%
summarise(
reviews1000 = sum(reviews_last30days[(calc.latitude>=(calc.latitude-500) | calc.latitude<=(calc.latitude+500))]))
return(to_return)
}
REVIEWPERAREA <- review1000(TOTALLISTINGS)
I know I also would have to add something for longitude in the code above
Does anyone have an idea how to fix this?
Any help or hints highly appreciated & thanks in advance! :)
See whether the below code will help.
TOTALLISTINGS$reviews1000 <- sapply(1:nrow(TOTALLISTINGS), function(r) {
currentLATI <- TOTALLISTINGS$calc.latitude[r]
currentLONG <- TOTALLISTINGS$calc.longitude[r]
sum(TOTALLISTINGS$reviews_last30days[between(TOTALLISTINGS$calc.latitude,currentLATI - 500, currentLATI + 500) & between(TOTALLISTINGS$calc.longitude,currentLONG - 500, currentLONG + 500)])
})

Performing a 2 sample t test in R with replicates

I have a dataframe name R_alltemp in R with 6 columns, 2 groups of data with 3 replicates each. I'm trying to perform a t-test for each row between the first three values and the last three and use apply() so it can go through all the rows with one line. Here is the code im using so far.
R_alltemp$p.value<-apply(R_all3,1, function (x) t.test(x(R_alltemp[,1:3]), x(R_alltemp[,4:6]))$p.value)
and here is a snapshot of the table
R1.HCC827 R2.HCC827 R3.HCC827 R1.nci.h1975 R2.nci.h1975 R3.nci.h1975 p.value
1 13.587632 22.225083 15.074230 58.187465 79 82.287573 0.4391160
2 2.717526 1.778007 1.773439 1.763257 2 1.679338 0.4186339
3 203.814478 191.135711 232.320487 253.908939 263 263.656100 0.4904493
4 44.386264 45.339169 54.089884 3.526513 3 5.877684 0.3095634
it functions, but the p-values im getting just from eyeballing it seem wrong. For instance in the first line, the average of the first group is way lower than the second group, but my p value is only .4.
I feel like I'm missing something very obvious here, but I've been struggling with it for much longer than I'd like. Any help would be appreciated.
Your code is incorrect. I actually don't understand why it does not return an error. This part in particular: x(R_alltemp[,1:3]) should be x[1:3].
This should be your code:
R_alltemp$p.value2 <- apply(R_alltemp, 1, function(x) t.test(x[1:3], x[4:6])$p.value)
R1.HCC827 R2.HCC827 R3.HCC827 R1.nci.h1975 R2.nci.h1975 R3.nci.h1975 p.value p.value2
1 13.587632 22.225083 15.074230 58.187465 79 82.287573 0.4391160 0.010595829
2 2.717526 1.778007 1.773439 1.763257 2 1.679338 0.4186339 0.477533387
3 203.814478 191.135711 232.320487 253.908939 263 263.656100 0.4904493 0.044883436
4 44.386264 45.339169 54.089884 3.526513 3 5.877684 0.3095634 0.002853154
Remember that by specifying 1 it you are telling apply to get the columns. So function(x) returns the equivalent of this: x <- c(13.587632, 22.225083, 15.074230, 58.187465, 79, 82.287573) which means you want to subset the first three values by x[1:3] and then the last three x[4:6] and apply t.test to them.
A good idea before using apply is to test the function manually so if you do get odd results like these you know something went wrong with your code.
So the two-tailed p-value for the first row should be:
> g1 <- c(13.587632, 22.225083, 15.074230)
> g2 <- c(58.187465, 79, 82.287573)
> t.test(g1,g2)$p.value
[1] 0.01059583
Applying the function across all rows (I tacked the new p-val at the end as pval:
> tt$pval <- apply(tt,1,function(x) t.test(x[1:3],x[4:6])$p.value)
> tt
R1.HCC827 R2.HCC827 R3.HCC827 R1.nci.h1975 R2.nci.h1975 R3.nci.h1975 p.value pval
1 13.587632 22.225083 15.074230 58.187465 79 82.287573 0.4391160 0.010595829
2 2.717526 1.778007 1.773439 1.763257 2 1.679338 0.4186339 0.477533387
3 203.814478 191.135711 232.320487 253.908939 263 263.656100 0.4904493 0.044883436
4 44.386264 45.339169 54.089884 3.526513 3 5.877684 0.3095634 0.002853154
Maybe it's the double-use of the data frame name in the function (that you don't need)?

names of a dataset returns NULL- R version 3.2.4 Revised-Ubuntu 14.04 LTS

I have a small issue regarding a dataset I am using. Suppose I have a dataset called mergedData2 defined using those command lines from a subset of mergedData:
mergedData=rbind(test_set,training_set)
lookformean<-grep("mean()",names(mergedData),fixed=TRUE)
lookforstd<-grep("std()",names(mergedData),fixed=TRUE)
varsofinterests<-sort(c(lookformean,lookforstd))
mergedData2<-mergedData[,c(1:2,varsofinterests)]
If I do names(mergedData2), I get:
[1] "volunteer_identifier" "type_of_experiment"
[3] "body_acceleration_mean()-X" "body_acceleration_mean()-Y"
[5] "body_acceleration_mean()-Z" "body_acceleration_std()-X"
(I takes this 6 first names as MWE but I have a vector of 68 names)
Now, suppose I want to take the average of each of the measurements per volunteer_identifier and type_of_experiment. For this, I used a combination of split and lapply:
mylist<-split(mergedData2,list(mergedData2$volunteer_identifier,mergedData2$type_of_experiment))
average_activities<-lapply(mylist,function(x) colMeans(x))
average_dataset<-t(as.data.frame(average_activities))
As average_activities is a list, I converted it into a data frame and transposed this data frame to keep the same format as mergedData and mergedData2. The problem now is the following: when I call names(average_dataset), it returns NULL !! But, more strangely, when I do:head(average_dataset) ; it returns :
volunteer_identifier type_of_experiment body_acceleration_mean()-X body_acceleration_mean()-Y
1 1 0.2773308 -0.01738382
2 1 0.2764266 -0.01859492
3 1 0.2755675 -0.01717678
4 1 0.2785820 -0.01483995
5 1 0.2778423 -0.01728503
6 1 0.2836589 -0.01689542
This is just a small sample of the output, to say that the names of the variables are there. So why names(average_dataset) returns NULL ?
Thanks in advance for your reply, best
EDIT: Here is an MWE for mergedData2:
volunteer_identifier type_of_experiment body_acceleration_mean()-X body_acceleration_mean()-Y
1 2 5 0.2571778 -0.02328523
2 2 5 0.2860267 -0.01316336
3 2 5 0.2754848 -0.02605042
4 2 5 0.2702982 -0.03261387
5 2 5 0.2748330 -0.02784779
6 2 5 0.2792199 -0.01862040
body_acceleration_mean()-Z body_acceleration_std()-X body_acceleration_std()-Y body_acceleration_std()-Z
1 -0.01465376 -0.9384040 -0.9200908 -0.6676833
2 -0.11908252 -0.9754147 -0.9674579 -0.9449582
3 -0.11815167 -0.9938190 -0.9699255 -0.9627480
4 -0.11752018 -0.9947428 -0.9732676 -0.9670907
5 -0.12952716 -0.9938525 -0.9674455 -0.9782950
6 -0.11390197 -0.9944552 -0.9704169 -0.9653163
gravity_acceleration_mean()-X gravity_acceleration_mean()-Y gravity_acceleration_mean()-Z
1 0.9364893 -0.2827192 0.1152882
2 0.9274036 -0.2892151 0.1525683
3 0.9299150 -0.2875128 0.1460856
4 0.9288814 -0.2933958 0.1429259
5 0.9265997 -0.3029609 0.1383067
6 0.9256632 -0.3089397 0.1305608
gravity_acceleration_std()-X gravity_acceleration_std()-Y gravity_acceleration_std()-Z
1 -0.9254273 -0.9370141 -0.5642884
2 -0.9890571 -0.9838872 -0.9647811
3 -0.9959365 -0.9882505 -0.9815796
4 -0.9931392 -0.9704192 -0.9915917
5 -0.9955746 -0.9709604 -0.9680853
6 -0.9988423 -0.9907387 -0.9712319
My duty is to get this average_dataset (which is a dataset which contains the average value for each physical quantity (column 3 and onwards) for each volunteer and type of experiment (e.g 1 1 mean1 mean2 mean3...mean68
2 1 mean1 mean2 mean3...mean68, etc)
After this I will have to export it as a txt file (so I think using write.table with row.names=F, and col.names=T). Note that for now, if I do this and import the dataset generated using read.table, I don't recover the names of the columns of the dataset; even while specifying col.names=T.

R dynamic data.frame subseting

I have a dataframe which is similar to this:
A B C D E F G H I J K origin
1 -0.7236848 -0.4245541 0.7083451 3.1623596 3.8169532 -0.04582876 2.0287920 4.409196 -0.3194430 5.9069321 2.7071142 1
2 -0.8317734 4.8795289 0.4585997 -0.2634786 -0.7881651 -0.37251184 1.0951245 4.157672 4.2433676 1.4588268 -0.6623890 1
3 -0.7633280 -0.2985844 -0.9139702 3.7480425 3.5672651 0.06220035 -0.3770195 1.101240 2.0921264 6.6496937 -0.7218320 1
4 0.8311566 -0.7939485 0.5295287 -0.5508124 -0.3175838 -0.63254736 0.6145020 4.186136 -0.1858161 -0.1864584 0.7278854 2
5 1.4768837 -0.7612165 0.8571546 2.3227173 -0.8568081 -0.87088020 0.2269735 4.386104 3.9009236 -0.6429641 3.6163318 2
6 -0.9335004 4.4542639 1.0238832 -0.2304124 0.8630241 -0.50556039 2.8072757 5.168369 5.9942144 0.6165200 -0.5349257 2
Note that the last variable is called origin; a factor label of levels 1 and 2; my real data set has more levels.
A function I am using requires this format:
result <- specialFuc(matrix1, matrix2, ....)
What I want to do, is write a function such that the input dataframe (or matrix) is split by "origin" then dynamically I would get multiple matrices to give to my "specialFuc"
my solution is:
for (i in 1:length(levels(df[,"origin"))){
assign(paste("Var", "_", i, sep=''), subset(df, origin!=i)))
}
using this, I can create a list of names which I use get() to put in my special function.
As you can imagine this is not dynamic,...
Any suggestions?
I think something like
do.call(specialFunc,
split.data.frame(df[,-ncol(df)],df$origin))
should do it?

R - subtracting multiple columns from multiple columns with 2 data frames

I have two dataframes as below:
> head(VN.GRACE.Int, 4)
DecimDate CSR GFZ JPL
1 2003.000 12.1465164 5.50259937 15.7402752
2 2003.083 1.8492431 0.27744418 3.4811423
3 2003.167 1.5168512 -0.06333961 1.7962201
4 2003.250 -0.2355813 6.16296554 0.7215013
> head(VN.GLDAS, 4)
Decim_Date NOAH_SManom CLM_SManom VIC_SManom SM_Month_Mean
1 2003.000 3.0596372 0.4023805 -0.2175665 1.081484
2 2003.083 -1.4459928 -1.0255955 -3.1338024 -1.868464
3 2003.167 -3.9945788 -1.4646734 -4.2052981 -3.221517
4 2003.250 -0.9737429 0.4213161 -1.0537822 -0.535403
EDIT: The names below (UN.GRACE.Int and UN.GLDAS) are the names of the two dataframes above. Have added an example of what the final data frame will look like.
I want to subtract columns [,2:5] in VN.GLDAS data frame from EACH of the columns [,2:4] in UN.GRACE.Int and put the results in a separate data frame (new data frame will have 12 columns) as below:
EXAMPLE <- data.frame(CSR_NOAH=numeric(), CSR_CLM=numeric(), CSR_VIC=numeric(), CSR_SM_Anom=numeric(),
GFZ_NOAH=numeric(), GFZ_CLM=numeric(), GFZ_VIC=numeric(), GFZ_SM_Anom=numeric(),
JPL_NOAH=numeric(), JPL_CLM=numeric(), JPL_VIC=numeric(), JPL_SM_Anom=numeric())
I've looked into 'sweep' as suggested in another post, but am not sure whether my query would be better suited using a for loop, which I'm a novice at. Also looked at subtracting values in one data frame from another but doesn't answer my query I don't believe - Thanks in advance
res <- cbind(VN.GRACE.Int[,1,drop=F],
do.call(cbind,lapply(VN.GLDAS[,2:5],
function(x) VN.GRACE.Int[,2:4]-x)))
dim(res)
#[1] 4 13

Resources