Subset data based on vector with repeated observations - r

I have the following data with two observations per subject:
SUBJECT <- c(8,8,10,10,11,11,15,15)
POSITION <- c("H","L","H","L","H","L","H","L")
TIME <- c(90,90,30,30,30,30,90,90)
RESPONSE <- c(5.6,5.2,0,0,4.8,4.9,1.2,.9)
DATA <- data.frame(SUBJECT,POSITION,TIME,RESPONSE)
I want the rows of DATA which have SUBJECT numbers that are in a vector, V:
V <- c(8,10,10)
How can I obtain both observations from DATA whose SUBJECT number is in V and have those observations repeated the same number of times as the corresponding SUBJECT number appears in V?
Desired result:
SUBJECT <- c(8,8,10,10,10,10)
POSITION <- c("H","L","H","L","H","L")
TIME <- c(90,90,30,30,30,30)
RESPONSE <- c(5.6,5.2,0,0,0,0)
OUT <- data.frame(SUBJECT,POSITION,TIME,RESPONSE)
I thought some variation of the %in% operator would do the trick but it does not account for repeated subject numbers in V. Even though a subject number is listed twice in V, I only get one copy of the corresponding rows in DATA.
I could also create a loop and append matching observations but this piece is inside a bootstrap sampler and this option would dramatically increase computation time.

merge is your friend:
merge(list(SUBJECT=V), DATA)
# SUBJECT POSITION TIME RESPONSE
#1 8 H 90 5.6
#2 8 L 90 5.2
#3 10 H 30 0.0
#4 10 L 30 0.0
#5 10 H 30 0.0
#6 10 L 30 0.0
As #Frank implies, this logic can be translated to data.table or dplyr or sql of anything else that will handle a left-join.

Related

R code to iteratively and randomly delete entire rows from a data frame based on a column value, and saving as a new data frame each time

Please forgive me if this question has been asked before!
So I have a dataframe (df) of individuals sampled from various populations with each individual given a population name and a corresponding number assigned to that population as follows:
Individual Population Popnum
ALM16-014 AimeesMdw 1
ALM16-024 AimeesMdw 1
ALM16-026 AimeesMdw 1
ALM16-003 AMKRanch 2
ALM16-022 AMKRanch 2
ALM16-075 BearPawLake 3
ALM16-076 BearPawLake 3
ALM16-089 BearPawLake 3
There are a total of 12 named populations (they do not all have the same number of individuals) with Popnum 1-12 in this file. What I need to do is randomly delete one or more populations (preferably using the 'Popnum' column) from the dataframe and repeating this 100 times and then saving each result as a separate dataframe (ie. df1, df2, df3, etc). The end result is 100 dfs with each one having one population removed randomly. The next step is to repeat this 100 times removing two random populations, then 3 random populations, and so on.
Any help would be greatly appreciated!!
You can write a function which takes dataframe as input and n i.e number of Popnum to remove.
remove_n_Popnum <- function(data, n) {
subset(data, !Popnum %in% sample(unique(Popnum), n))
}
To get one popnum you can do :
remove_n_Popnum(df, 1)
# Individual Population Popnum
#1 ALM16-014 AimeesMdw 1
#2 ALM16-024 AimeesMdw 1
#3 ALM16-026 AimeesMdw 1
#4 ALM16-003 AMKRanch 2
#5 ALM16-022 AMKRanch 2
To do this 100 times you can use replicate
list_data <- replicate(100, remove_n_Popnum(df1, 1), simplify = FALSE)
To pass different n in remove_n_Popnum function you can use lapply
nested_list_data <- lapply(seq_along(unique(df$Popnum)[-1]),
function(x) replicate(100, remove_n_Popnum(df, x), simplify = FALSE))
where seq_along generates a sequence which is 1 less than the number of unique values.
seq_along(unique(df$Popnum)[-1])
#[1] 1 2

R function to compare two lists against each other for correction

Lets say I have two vectors for example:
a <- c(1,2,2.1,2.2,3,4,5,6,7,8,9)
b <- c(0.5,1.5,2.5,3.5,4.5,5.5,6.5,7.5,7.6,7.7,8.5)
Filter the two vectors with a function:
a_corrected_by_b <- c(1,2,3,4,5,6,7,8,9)
b_corrected_by_a <- c(0.5,1.5,2.5,3.5,4.5,5.5,6.5,7.5,8.5)
So when you subtract the two lists from each other you get:
c <- (a_corrected_by_b - b_corrected_by_a)
#((1-0.5), (2-1.5), (3-2.5), (4-3.5), (5-4.5), (6-5.5), (7-6.5), (8-7.7), (9-8.5))
I just cant figure out how to filter both of the vectors against each other so you can get two vectors that can subtract from each other properly.
There should be an a vector value in between each b vector value and there should be a b vector value in between each a vector value.
It should also be the closest value to the value in either list as well. So is there any function that can run through both vectors and remove the unnecessary values? So the two vectors can be the same length and have the proper values to subtract from each other.
It seems you want to compare every number from a to every number from b, and keep any member of a which is the smallest member of the set that exceeds one of the members of b. If a single member of a meets this criterion for multiple values of b, you only include the highest value of b among the matches.
It really has to be done this way because if you allow members of a to be matched to members of b that are either lower or higher, then the matching becomes undecidable and ill defined.
You can achieve this using outer and unique:
correct <- function(a, b)
{
d <- outer(a, b, `-`)
d[d < 0] <- NA
data.frame(a_corrected_by_b = a[unlist(unique(apply(d, 2, which.min)))],
b_corrected_by_a = b[unlist(unique(apply(d, 1, which.min)))])
}
So for example:
correct(a, b)
#> a_corrected_by_b b_corrected_by_a
#> 1 1 0.5
#> 2 2 1.5
#> 3 3 2.5
#> 4 4 3.5
#> 5 5 4.5
#> 6 6 5.5
#> 7 7 6.5
#> 8 8 7.7
#> 9 9 8.5
Note there are some sequences that won't "correct" by this method, for example if each member of b is higher than each member of a.
Also, it is not symmetrical, since one of the two sets has to take precedence in the event of ties, so correct(a, b) will be different from correct(b, a), though they will both give valid solutions. This is because the problem as stated does not always have a unique solution.

Searching a named vector in r

In a named vector i want those words starting with good along with their frequency. I am getting the word only but not frequency
v <- c(10,20,30,40,50)
names(v) <- c("good afternoon", "hi", "this","good morning","what")
v
# gives error
grep("^good",v,value = TRUE)
# below code works but frequency not showing
grep("^good",names(v),value = TRUE)
I'm not entirely clear what you're asking.
You could stack the vector to give a data.frame with two columns: the values corresponding to your frequency (?) and the expression ind.
stack(v)
# values ind
#1 10 good afternoon
#2 20 hi
#3 30 this
#4 40 good morning
#5 50 what
Then to get the frequency and expression that matches your regexp you could do
stack(v)[grep("^good", stack(v)$ind), ]
# values ind
#1 10 good afternoon
#4 40 good morning
In response to your comment, is this what you're after?
v[grep("^good", names(v))]
#good afternoon good morning
# 10 40
This return object is again a named vector with the vector entries giving the frequencies and the names of the vector corresponding to the expressions.
You want number of hits?
length(grep("^good",names(v)))
# [1] 2

R For Loop anomaly when expanding the range

Assume the following dataframe:
Application <- c('A','A','B','B','B','C','C','D')
Rating <- c('0','0.6','0.6','2.0','2.0','3.8','3.8','3.9')
DF <- data.frame(Application,Rating)
DF
#Application Score
#1 A 0
#2 A 0.6
#3 B 0.6
#4 B 2.0
#5 B 2.0
#6 C 3.8
#7 C 3.8
#8 D 3.9
I want to create an empty results table to be populated through a loop:
1st column - to show the rating being counted (e.g. 0.6)
2nd column - to show the number of times that rating occurs in DF
3rd column - to list total number of ratings in DF (i.e. 8)
4th column - to calculate the proportion of the applications with that rating relative to the overall
#create empty results table
results_rating_bins <- as.data.frame(matrix(nrow = 1, ncol = 4))
#initiate row count
rownr = 1
#Loop:
for (rating in seq(from = 0, to = 4.0, by = 0.1)) {
this_rating <- subset(DF, DF$Score == rating)
results_rating_bins[rownr, 1] = rating
results_rating_bins[rownr, 2] = nrow(this_rating)
results_rating_bins[rownr, 3] = nrow(DF)
results_rating_bins[rownr, 4] = nrow(this_rating) / nrow(DF)
rownr <- rownr + 1
}
The final result is what I expect, except for rating 2.0 where the count is 0 even though it should be 2.
This illustrates at small scale, what I see at larger scale with a 30k line dataset. I have a list of apps with ratings going from 0 to 4.9, so the range in my loop would be set to 0 to 4.9 instead of 0.6 to 4.0 in my example. However, when I run the loop on the large dataset I end up with a number of instances where the rating count is 0 even though it shouldn't be. What's even more odd, is that by playing around with the ranges, the ratings where the anomaly (i.e. count = 0) happens varies completely randomly.
Any idea what may justify this type of behaviour?
Amnesty
Typically I answer the questions as asked, trying to work through the logic a question poster is already using. However, in this case, it is so much easier to use dplyr to aggregate into the new table that I am breaking with tradition.
require(dplyr)
Application <- c('A','A','B','B','B','C','C','D')
Rating <- c('0','0.6','0.6','2.0','2.0','3.8','3.8','3.9')
DF <- data.frame(Application,Rating)
df2<-DF%>%
group_by(Application, Rating)%>%
summarize(ratio=(n()/nrow(DF)))
The first part is the same as yours, but with the library call added
where it starts df2 you are setting the df2 data frame equal to a grouped version of your initial data frame based on the combinations of Application and Rating. In the summarize statement, for each possible combination we tell it to count the number n() and divide it by the total number of rows in the original data frame nrow(DF), This creates the third row of your new the percent of total each pair represents.
It looks like this and you could add the column with the number of rows with another summarize statement if you need it, but to perform this function, it is not necessary.
Application Rating ratio
1 A 0 0.125
2 A 0.6 0.125
3 B 0.6 0.125
4 B 2.0 0.250
5 C 3.8 0.250
6 D 3.9 0.125
This will absolutely catch every combination of Application and Rating and calculate the ratio relative to the whole data frame.
EDIT: If you do not care about the Application letter, you cans imply remove it from the group_by function and still get what you want.
And add
%>%
summarise(rows=nrow(DF))
if you want the total number of rows in the frame on each row

R: For loop to perform Tidal analysis with Harmonic Constituents

**EDITED, I have made progress, but didn't think my original question was as well constructed as it could be.
I am new to R and computer programming in general and I am attempting to write my first for loop.
I want to be able to do some tidal analysis using harmonic constituents from NOAA.
I have my initial data=data which looks like:
Constituent # Name Amplitude Phase Speed
1 M2 3.264 29.0 28.98
2 S2 0.781 51.9 30.0
3 N2 0.63 12.3 28.43
4 K1 1.263 136.8 15.04
5 M4 0.043 286.0 57.96
The equation for wave height is h(t)= Amplitude*cos(Speed*t-Phase) where t is time.
Therefore I need to perform this calculation for each constituent (row) and sum the results of each constituent by time.
So my middle result will be a table with the ncols=number of time stamps and the nrow= number of constituents.
T1 T2 T3...
data[1,3]*cos(data[1,4]*T1-data[1,3]) data[1,3]*cos(data[1,4]*T2-data[1,3])
data[2,3]*cos(data[2,4]*T1-data[2,3]) data[2,3]*cos(data[2,4]*T2-data[2,3])
.
.
.
data[n,3]*cos(data[n,4]*T1-data[n,3]) data[n,3]*cos(data[n,4]*T2-data[n,3])
With this table I can sum the columns to get my final answer of what the tide height is at each time stamp.
To do this I have attempted to create a for loop.
DF=NULL
for (i in 1:nrow(data)){
DF<- matrix(c(DF, data[i,2]*cos(pi/180*(data[i,4]*Time[,]-data[i,3]))))
}
This returns all the results a single vector. I can't figure out how to separate it into columns by the timestamp. It just runs through all the timestamps for the furst constituent, then the second and so on. So for my current station I have 37 constituents and 100 time stamps so my matrix DF is 1 column with 3700 rows.
I have tried setting the matrix DF with the appropriate number of columns and rows, but this returns a single result for all rows and columns. I have also tried a nested if statement with time, and many other things that I can't remember.
***Used Rusan's approach and finished what I was doing with the script below. Any other approaches are appreciated.
Time<-matrix(seq(1,100,1)) #my time series
n<-hh3(Time) #Function outlined by Rusan below
b<- matrix(c(rep(Time[1,1]:Time[nrow(Time),1], nrow(wave_table)))) #A repeating list to bind with n
height<-matrix(colSums(dcast(data.frame(cbind(b,n)),Constituent~V1,value.var="V1.1")[,-1])) #The sums of all the constituents at each time stamp, the final height of the wave at each time
This allows me to sum all the constituents at each time stamp. Height=sum of all constituents at time t. So for my example above height(t1)=M2(t1)+S2(t1)+N2(t1)+K1(t1)+M4(t1)
My final output is a matrix of a single vector height. I want this to create an inundation duration curve.
Perhaps this is not an answer - but I would suggest a different approach. I will use the package data.table in R.
library(data.table)
#use own location of your data
wave_table=fread(input="F:\\wave.csv");
wave_table
# Constituent Name Amplitude Phase Speed
# 1: 1 M2 3.264 29.0 28.98
# 2: 2 S2 0.781 51.9 30.00
# 3: 3 N2 0.630 12.3 28.43
# 4: 4 K1 1.263 136.8 15.04
# 5: 5 M4 0.043 286.0 57.96
#create a function which does your calculation on the named columns of your data,
#taking time 't' as a parameter
hh<-function(t){ wave_table[,{Amplitude*cos(Speed*t-Phase)}] }
hh2<-function(t) wave_table[,{Amplitude*cos(Speed*t-Phase)}, by=Name]
hh3<-function(t) wave_table[,{Amplitude*cos(Speed*t-Phase)}, by=Constituent]
hh4<-function(t) wave_table[,{sum(Amplitude*cos(Speed*t-Phase))}, by=Constituent]
#Now the function `hh` can be used like this, giving you a bit
#more flexibility with what you want to do, perhaps?
hh(1)
#3.26334722 -0.77775795 -0.57472163 -0.91362687 -0.01165717
or
hh2(1)
# Name V1
#1: M2 3.26334722
#2: S2 -0.77775795
#3: N2 -0.57472163
#4: K1 -0.91362687
#5: M4 -0.01165717
or
hh4(1) #after adding an extra row to your data: "Constituent=1, Name=M3,
#Amp=1.263,Phase=51.9, Speed=15.04
# Constituent V1
#1: 1 4.10718774
#2: 2 -0.77775795
#3: 3 -0.57472163
#4: 4 -0.91362687
#5: 5 -0.01165717
In general, loops in R for this type of problem should be avoided, as they are slow/there are much better tools available. Loops are typically "last resort".
If the function hh to hh4 do not do exactly what you want, there are other variations that could be used. Check out http://cran.r-project.org/web/packages/data.table/vignettes/datatable-faq.pdf

Resources