Linear interpolation by multiple groupings in R - r

I have the following data set:
District Type DaysBtwn Start_Day End_Day Start_Vol End_Vol
1 A 0 3 0 31 28 23
2 A 1 3 0 31 24 0
3 B 0 3 0 31 17700 10526
4 B 1 3 0 31 44000 35800
5 C 0 3 0 31 5700 0
6 C 1 3 0 31 35000 500
For each of the group combinations District & Type, I want to do a simple linear interpolation: for a x=Days (Start_Day and End_Day) and y=Volumes (Start_Vol and End_Vol), I want the estimated volume returned for xout=DaysBtwn.
I have tried so many things. I think I am having issues because of the way my data is set up. Can someone point me in the right direction for how to use the approx function in R to get the desired output? I don't mind moving my data set around to get the correct format for approx.`
Example of desired output:
District Type EstimatedVol
1 0 25
2 1 15
3 0 13000
4 1 39000
5 0 2500
6 1 25000
dt <- data.table(input) interpolation <- dt[, approx(x,y,xout=z), by=list(input$District,input$Type)]

Why not simply calculate it directly?
dt$EstimatedVol <- (End_Vol - Start_Vol) / (End_Day - Start_Day) * (DaysBtwn - Start_Day) + Start_Vol

Related

Conditional Statements: selecting/assigning a variable per row

I have a data set with 2 VPs and 350 interval values for each. I am writing an if loop to select when a minimum value of VP1 overlaps with the maximum value of VP2.
The data usually sorts by VP, but I arranged to sort by minimum since it is a timeframe.
I ran the following code that worked to assign 0 or 1 when the values overlap the previous item, but it does not account for what the previous item is (ie. whether the previous item is VP1 or VP2).
for (i in 2:length(df$newvariable)) {
if (df$minimum[i] < df$maximum[i-1]){
df$newvariable[i] <- 0
} else {
df$newvariable[i] <- 1
}
}
I want to say if df$minimum[i] of VP1 < df$maximum[i] of VP2, then df$newvariable = 0. Otherwise, df$newvariable = 1.
I have not been able to find how to make it conditional per each row and loop again. Does anyone have any recommendations?
Many thanks.
Sample Data:
VP xmin xmax
1 0 6
2 0 2
2 6 14
1 14 24
2 20 30
1 30 36
... And so on for 600 or so rows.
Desired Output:
VP xmin xmax newvariable
1 0 6 -
2 0 2 0
2 6 14 1
1 14 24 1
2 20 30 0
1 30 36 1
If I have a dataframe that had another variable and I subsetted to only look at one part of the variable. For example, variable = talking and the assignments are 1 (yes) or 0 (no). I originally subsetted to just look at 0 and create new variables, like quiet_together. However, I want to put these dataframes back together but have added columns in the separate dataframes. If I want the same exact thing as described above but with the dataframe together (instead of 2 separate ones), how would I specify for the each assigned variable? I want to end up with two new columns based on xmin and xmax values while accounting for the value in the talking variable. The new columns would be talk_together (for the 1 value of the talking variable) and quiet_together (for the 0 value of the talking variable, when xmin <= xmax for the previous line.
For example:
Sample Data:
VP xmin xmax talking
1 0 6 0
2 0 2 0
2 2 6 1
2 6 14 0
1 6 14 1
2 14 24 1
1 14 20 0
1 20 30 1
2 24 32 0
1 30 32 0
... And so on for 600 or so rows.
Desired Output:
VP xmin xmax talking talk_together quiet_together
1 0 6 0 0 0
2 0 2 0 0 0
2 2 6 1 0 0
2 6 14 0 0 0
1 6 14 1 0 0
1 14 20 0 0 0
2 14 24 1 1 0
1 20 30 1 1 0
2 24 32 0 0 1
1 30 32 0 0 1
You could use lag from dplyr to compare with previous xmax value.
library(dplyr)
df %>% mutate(newvariable = as.integer(xmin >= lag(xmax)))
# VP xmin xmax newvariable
#1 1 0 6 NA
#2 2 0 2 0
#3 2 6 14 1
#4 1 14 24 1
#5 2 20 30 0
#6 1 30 36 1
Or shift with data.table
library(data.table)
setDT(df)[, newvariable := +(xmin >= shift(xmax))]
Base R alternatives are :
df$newvariable <- as.integer(c(NA, df$xmin[-1] >= df$xmax[-nrow(df)]))
and
df$newvariable <- +c(NA, tail(df$xmin, -1) >= head(df$xmax, -1))
With data.table, we can do
library(data.table)
setDT(df)[, newvariable := as.integer(xmin >= shift(xmax))]

R Frequency Tables: prop.table does not work if all data points within variable share the outcome?

imagine, you have the following data set:
df<-data.frame(read.table(header = TRUE, text = "
ID Wine Beer Water Age Gender
1 0 1 0 20 Male
2 1 0 1 38 Female
3 0 0 1 32 Female
4 1 0 1 30 Male
5 1 1 1 30 Male
6 1 1 1 26 Female
7 0 1 1 36 Female
8 0 1 1 29 Male
9 0 1 1 33 Female
10 0 1 1 20 Female"))
Further, imagine you want to compile summary tables that print out the frequencies of those that drink wine, beer, water.
I solved it that way.
con<-apply(df[,c(2:4)], 2, table)
con_P<-prop.table(con,2)
It works perfectly. No problem. Now, let us tweak the data set as follows: We set all entries for water to 1.
df<-data.frame(read.table(header = TRUE, text = "
df<-data.frame(read.table(header = TRUE, text = "
ID Wine Beer Water Age Gender
1 0 1 1 20 Male
2 1 0 1 38 Female
3 0 0 1 32 Female
4 1 0 1 30 Male
5 1 1 1 30 Male
6 1 1 1 26 Female
7 0 1 1 36 Female
8 0 1 1 29 Male
9 0 1 1 33 Female
10 0 1 1 20 Female"))
If I now run the following commands:
con<-apply(df[,c(2:4)], 2, table)
con_P<-prop.table(con,2)
it gives me the following error message after the second line: Error in margin.table(x, margin) : 'x' is not an array! Why?
Why does it make a difference if all data points within a variable have all the same outcome? Also, what can I do to circumvent this problem? Thanks guys!
The function prop.table uses the function sweep which takes an array as first argument. Since your second con is a list and not an array, prop.table will fail.
Why is your second con a list? Because the column Water has just one element and all the other columns have 2 elements. When the number of elements is different apply can't simplify the result to an array and gives you a list.
In the example you gave us, a safer way is to work with lapply instead, it will always give a list with the results:
con <- lapply(df, table)
con_P <- lapply(con, function(x) x/sum(x))

transform values in data frame, generate new values as 100 minus current value

I'm currently working on a script which will eventually plot the accumulation of losses from cell divisions. Firstly I generate a matrix of values and then I add the number of times 0 occurs in each column - a 0 represents a loss.
However, I am now thinking that a nice plot would be a degradation curve. So, given the following example;
>losses_plot_data <- melt(full_losses_data, id=c("Divisions", "Accuracy"), value.name = "Losses", variable.name = "Size")
> full_losses_data
Divisions Accuracy 20 15 10 5 2
1 0 0 0 0 3 25
2 0 0 0 1 10 39
3 0 0 1 3 17 48
4 0 0 1 5 23 55
5 0 1 3 8 29 60
6 0 1 4 11 34 64
7 0 2 5 13 38 67
8 0 3 7 16 42 70
9 0 4 9 19 45 72
10 0 5 11 22 48 74
Is there a way I can easily turn this table into being 100 minus the numbers shown in the table? If I can plot that data instead of my current data, I would have a lovely curve of degradation from 100% down to however many cells have been lost.
Assuming you do not want to do that for the first column:
fld <- full_losses_data
fld[, 2:ncol(fld)] <- 100 - fld[, -1]

Building a contingency table

I have a data like this:
A B
1 10
1 20
1 30
2 10
2 30
2 40
3 20
3 10
3 30
4 20
4 10
5 10
5 10
and I want to build a contingency table like this:
10 20 30 40
10 1 3 2 0
20 3 0 2 0
30 2 2 0 0
40 0 0 0 0
Meaning: According to column A, for each two values of column B mark + 1 in the specific Contingency table.
Can you help me do this?
Here is a very ugly answer, using the data from the image, because I already spent too much time on your problem. In general, it's not practical to have your result depend on the order of variables.
A <- rep(c(1:4),c(3,2,3,3))
B <- c(10,10,30,10,20,30,20,10,10,20,30)
data <- data.frame(cbind(A,B))
#split by A
library(plyr)
data2 <- ddply(data,.(A),function(x){
combined_pairs <- cbind(x$B[-nrow(x)],
x$B[-1])
#return data where first is always lowest
smallest <- apply(combined_pairs,MARGIN=1,
FUN=min)
largest <- apply(combined_pairs,MARGIN=1,
FUN=max)
return(data.frame(small=smallest,large=largest))
})
library(reshape2)
result <- dcast(small~large,data=data2,
fun.aggregate=length)
> result
small 10 20 30
1 10 1 3 1
2 20 0 0 2
I think you can add the empty rows yourself if you still need them.

summing a range of columns in data frame

I am having trouble summing select columns within a data frame, a basic problem that I've seen numerous similar, but not identical questions/answers for on StackOverflow.
With this perhaps overly complex data frame:
site<-c(223,257,223,223,257,298,223,298,298,211)
moisture<-c(7,7,7,7,7,8,7,8,8,5)
shade<-c(83,18,83,83,18,76,83,76,76,51)
sampleID<-c(158,163,222,107,106,166,188,186,262,114)
bluestm<-c(3,4,6,3,0,0,1,1,1,0)
foxtail<-c(0,2,0,4,0,1,1,0,3,0)
crabgr<-c(0,0,2,0,33,0,2,1,2,0)
johnson<-c(0,0,0,7,0,8,1,0,1,0)
sedge1<-c(2,0,3,0,0,9,1,0,4,0)
sedge2<-c(0,0,1,0,1,0,0,1,1,1)
redoak<-c(9,1,0,5,0,4,0,0,5,0)
blkoak<-c(0,22,0,23,0,23,22,17,0,0)
my.data<-data.frame(site,moisture,shade,sampleID,bluestm,foxtail,crabgr,johnson,sedge1,sedge2,redoak,blkoak)
I want to sum the counts of each plant species (bluestem, foxtail, etc. - columns 4-12 in this example) within each site, by summing rows that have the same site number. I also want to keep information about moisture and shade (these are consistant withing site, but may also be the same between sites), and want a new column that is the count of number of rows summed.
the result would look like this
site,moisture,shade,NumSamples,bluestm,foxtail,crabgr,johnson,sedge1,sedge2,redoak,blkoak
211,5,51,1,0,0,0,0,0,1,0,0
223,7,83,4,13,5,4,8,6,1,14,45
257,7,18,2,4,2,33,0,0,1,1,22
298,8,76,3,2,4,3,9,13,2,9,40
The problem I am having is that, my real data sets (and I have several of them) have from 50 to 300 plant species, and I want refer a range of columns (in this case, [5:12] ) instead of my.data$foxtail, my.data$sedge1, etc., which is going to be very difficult with 300 species.
I know I can start off by deleting the column I don't need (SampleID)
my.data$SampleID <- NULL
but then how do I get the sums? I've messed with the aggregate command and with ddply, and have seen lots of examples which call particular column names, but just haven't gotten anything to work. I recognize this is a variant of a commonly asked and simple type of question, but I've spent hours without resolving it on my own. So, apologies for my stupidity!
This works ok:
x <- aggregate(my.data[,5:12], by=list(site=my.data$site, moisture=my.data$moisture, shade=my.data$shade), FUN=sum, na.rm=T)
library(dplyr)
my.data %>%
group_by(site) %>%
tally %>%
left_join(x)
site n moisture shade bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1 211 1 5 51 0 0 0 0 0 1 0 0
2 223 4 7 83 13 5 4 8 6 1 14 45
3 257 2 7 18 4 2 33 0 0 1 1 22
4 298 3 8 76 2 4 3 9 13 2 9 40
Or to do it all in dplyr
my.data %>%
group_by(site) %>%
tally %>%
left_join(my.data) %>%
group_by(site,moisture,shade,n) %>%
summarise_each(funs(sum=sum)) %>%
select(-sampleID)
site moisture shade n bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1 211 5 51 1 0 0 0 0 0 1 0 0
2 223 7 83 4 13 5 4 8 6 1 14 45
3 257 7 18 2 4 2 33 0 0 1 1 22
4 298 8 76 3 2 4 3 9 13 2 9 40
Try following using base R:
outdf<-data.frame(site=numeric(),moisture=numeric(),shade=numeric(),bluestm=numeric(),foxtail=numeric(),crabgr=numeric(),johnson=numeric(),sedge1=numeric(),sedge2=numeric(),redoak=numeric(),blkoak=numeric())
my.data$basic = with(my.data, paste(site, moisture, shade))
for(b in unique(my.data$basic)) {
outdf[nrow(outdf)+1,1:3] = unlist(strsplit(b,' '))
for(i in 4:11)
outdf[nrow(outdf),i]= sum(my.data[my.data$basic==b,i])
}
outdf
site moisture shade bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1 223 7 83 13 5 4 8 6 1 14 45
2 257 7 18 4 2 33 0 0 1 1 22
3 298 8 76 2 4 3 9 13 2 9 40
4 211 5 51 0 0 0 0 0 1 0 0

Resources