I have a problem regarding results from an aggregate function in R. My aim is to select certain bird species from a data set and calculate the density
of observed individuals over the surveyed area. To that end, I took a subset of the main data file, then aggregated over area, calculating the
mean, and the number of individuals (represented by length of vector). Then I wanted to use the calculated mean area and number of individuals to
calculate density. That didn't work. The code I used is given below:
> head(data)
positionmonth positionyear quadrant Species Code sum_areainkm2
1 5 2014 1 Bar-tailed Godwit 5340 155.6562
2 5 2014 1 Bar-tailed Godwit 5340 155.6562
3 5 2014 1 Bar-tailed Godwit 5340 155.6562
4 5 2014 1 Bar-tailed Godwit 5340 155.6562
5 5 2014 1 Gannet 710 155.6562
6 5 2014 1 Bar-tailed Godwit 5340 155.6562
sub.gannet<-subset(data, species == "Gannet")
sub.gannet<-data.frame(sub.gannet)
x<-sub.gannet
aggr.gannet<-aggregate(sub.gannet$sum_areainkm2, by=list(sub.gannet$positionyear, sub.gannet$positionmonth, sub.gannet$quadrant, sub.gannet$Species, sub.gannet$Code), FUN=function(x) c(observed_area=mean(x), NoInd=length(x)))
names(aggr.gannet)<-c("positionyear", "positionmonth", "quadrant", "species", "code", "x")
aggr.gannet<-data.frame(aggr.gannet)
> aggr.gannet
positionyear positionmonth quadrant species code x.observed_area x.NoInd
1 2014 5 4 Gannet 710 79.8257 10.0000
density <- c(aggr.gannet$x.NoInd/aggr.gannet$x.observed_area)
aggr.gannet <- cbind(aggr.gannet, density)
Error in data.frame(..., check.names = FALSE) :
Arguments imply differing number of rows: 1, 0
> density
numeric(0)
> aggr.gannet$x.observed_area
NULL
> aggr.gannet$x.NoInd
NULL
R doesn't seem to view the results from the function (observed_area and NoInd) as numeric values in their own right. That was already apparent, when I couldn't give them a name each, but had to call them "x".
How can I calculate density under these circumstances? Or is there another way to aggregate with multiple functions over the same variable that will result in a usable output?
It's a quirk of aggregate with multiple aggregations that the resulting aggregations are stored in a list within the column related to the aggregated variable.
The easiest way to get rid of this is to go through an as.list before as.dataframe, which flattens the data structure.
aggr.gannet <- as.data.frame(as.list(aggr.gannet))
It will still use x as the name. The way I discovered to fix this is to use the formula interface to aggregate, so your aggregate would look more like
aggr.gannet<-aggregate(
sum_areainkm2 ~ positionyear + positionmonth +
quadrant + Species + Code,
data=sub.gannet,
FUN=function(x) c(observed_area=mean(x), NoInd=length(x)))
Walking it through (here I haven't taken the subset to illustrate the aggregation by species)
df <- structure(list(positionmonth = c(5L, 5L, 5L, 5L, 5L, 5L), positionyear = c(2014L, 2014L, 2014L, 2014L, 2014L, 2014L), quadrant = c(1L, 1L, 1L, 1L, 1L, 1L), Species = structure(c(1L, 1L, 1L, 1L, 2L, 1L), .Label = c("Bar-tailed Godwit", "Gannet"), class = "factor"), Code = c(5340L, 5340L, 5340L, 5340L, 710L, 5340L), sum_areainkm2 = c(155.6562, 155.6562, 155.6562, 155.6562, 155.6562, 155.6562)), .Names = c("positionmonth", "positionyear", "quadrant", "Species", "Code", "sum_areainkm2"), class = "data.frame", row.names = c(NA, -6L))
df.agg <- as.data.frame(as.list(aggregate(
sum_areainkm2 ~ positionyear + positionmonth +
quadrant + Species + Code,
data=df,
FUN=function(x) c(observed_area=mean(x), NoInd=length(x)))))
Which results in what you want:
> df.agg
positionyear positionmonth quadrant Species Code
1 2014 5 1 Gannet 710
2 2014 5 1 Bar-tailed Godwit 5340
sum_areainkm2.observed_area sum_areainkm2.NoInd
1 155.6562 1
2 155.6562 5
> names(df.agg)
[1] "positionyear" "positionmonth"
[3] "quadrant" "Species"
[5] "Code" "sum_areainkm2.observed_area"
[7] "sum_areainkm2.NoInd"
Obligatory note here, that dplyr and data.table are powerful libraries that allow doing this sort of aggregation very simply and efficiently.
dplyr
Dplyr has some strange syntax (the %>% operator), but ends up being quite readable, and allows chaining more complex operations
> require(dplyr)
> df %>%
group_by(positionyear, positionmonth, quadrant, Species, Code) %>%
summarise(observed_area=mean(sum_areainkm2), NoInd = n())
data.table
data.table has a more compact syntax and may be faster with large datasets.
dt[,
.(observed_area=mean(sum_areainkm2), NoInd=.N),
by=.(positionyear, positionmonth, quadrant, Species, Code)]
Related
I'm new to this community but hope that someone will be able to help me with this issue:
I am trying to find the changes of plane efficiency data after the implementation of an intervention (nudge) in 2014. For this, I only need a single breakpoint in 2014 to be able to compare the phases before and after the implementation and then do the same with a control group.
Using fixed.psi = 2014 I was able to fix the first psi, however, R identifies another breakpoint later in 2016 which should not be included. So I am trying to create a plot that shows the linear regression from 2009 through 2014 and is followed by an independent linear regression from 2014 through 2019.
Here's my code so far:
# input data:
year period nudge fcost. fconsumption ask distance
1 2009 1 0 NA 396176200 34468768133 NA
2 2010 2 0 NA 403415300 33502639755 NA
3 2011 3 0 NA 381698000 35648670708 NA
4 2012 4 0 NA 409338200 37250324313 NA
5 2013 5 0 NA 398479550 39405973517 NA
6 2014 6 1 NA 406376750 40978703492
# get the data
setwd("/Users/username/R/Thesis")
vaa_data <- read.csv("dataset.csv", na="NA", sep=";")
vaa_data <- vaa_data[-12,]
vaa_data$efficiency <- with(vaa_data, ask/fconsumption)
# create a linear regression from 2009 to 2019 (all data)
model1 <- lm(efficiency ~ year, data=vaa_data)
# create a segmented regression with 2014 as the fixed psi/breakpoint
seg1 <- segmented(obj = model1,
seg.Z = ~ year,
psi = 2014,
fixed.psi = 2014)
# get the fitted data
fitted <- fitted(seg1)
segmodel <- data.frame(Year = vaa_data$year, Efficiency = fitted)
# plot the fitted model
ggplot(segmodel, aes(x = Year, y = Efficiency)) + geom_line()
# output dput(vaa_data):
structure(list(year = 2009:2019, period = 1:11, nudge = c(0L,
0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L), fcost. = c(NA, NA, NA,
1012000000L, 979000000L, 854700000L, 525500000L, 435200000L,
548600000L, 697900000L, 686300000L), fconsumption = c(517301594L,
486740423L, 502363575L, 511000000L, 498000000L, 482299406L, 460164739L,
423060756L, 413466614L, 426236232L, 434442862L), ask = c(4.87e+10,
4.65e+10, 4.92e+10, 5.0466e+10, 5.033e+10, 4.871e+10, 4.8385e+10,
4.7175e+10, 4.6154e+10, 4.7747e+10, 4.8832e+10), distance = c(148440000L,
138140000L, 145130000L, 149230000L, 154480000L, 149990000L, 150230000L,
142910000L, 138790000L, 150840000L, 159260000L), efficiency = c(94.1423737426179,
95.5334667159954, 97.9370369358487, 98.7592954990215, 101.064257028112,
100.995355569648, 105.147126451164, 111.508806550707, 111.626908768987,
112.020040567551, 112.40143243509)), row.names = c(NA, 11L), class = "data.frame")
Any suggestions on what to do? Thanks so much in advance!
This is what I got so far. The first part is perfect, but I need to 'skip' the second breaking point:
I have collected a dataframe that models the duration of time for events in a group problem solving session in which the members Communicate (Discourse Code) and construct models (Modeling Code). Each minute that that occurs is captured in the Time_Processed column. Technically these events occur simultaneously. I would like to know how long the students are constructing each type of model which is the total duration of that model or the time elapsed before that model changes.
I have the following dataset:
Looks like this:
`Modeling Code` `Discourse Code` Time_Processed
<fct> <fct> <dbl>
1 OFF OFF 10.0
2 MA Q 11.0
3 MA AG 16.0
4 V S 18.0
5 V Q 20.0
6 MA C 21.0
7 MA C 23.0
8 MA C 25.0
9 V J 26.0
10 P S 28.0
# My explicit dataframe.
df <- structure(list(`Modeling Code` = structure(c(3L, 2L, 2L, 6L,
6L, 2L, 2L, 2L, 6L, 4L), .Label = c("A", "MA", "OFF", "P", "SM",
"V"), class = "factor"), `Discourse Code` = structure(c(7L, 8L,
1L, 9L, 8L, 2L, 2L, 2L, 6L, 9L), .Label = c("AG", "C", "D", "DA",
"G", "J", "OFF", "Q", "S"), class = "factor"), Time_Processed = c(10,
11, 16, 18, 20, 21, 23, 25, 26, 28)), row.names = c(NA, -10L), .Names = c("Modeling Code",
"Discourse Code", "Time_Processed"), class = c("tbl_df", "tbl",
"data.frame"))
For this dataframe I can find how often the students were constructing each type of model logically like this.
With Respect to the Modeling Code and Time_Processed columns,
At 10 minutes they are using the OFF model method, then at 11 minutes, they change the model so the duration of the OFF model is (11 - 10) minutes = 1 minute. There are no other occurrences of the "OFF" method so the duration of OFF = 1 min.
Likewise, for Modeling Code method "MA", the model is used from 11 minutes to 16 minutes (duration = 5 minutes) and then from 16 minutes to 18 minutes before the model changes to V with (duration = 2 minutes), then the model is used again at 21 minutes and ends at 26 minutes (duration = 5 minutes). So the total duration of "MA" is (5 + 2 + 5) minutes = 12 minutes.
Likewise the duration of Modeling Code method "V" starts at 18 minutes, ends at 21 minutes (duration = 3 minutes), resumes at 26 minutes, ends at 28 minutes (duration = 2) minutes. So total duration of "V" is 3 + 2 = 5 minutes.
Then the duration of Modeling Code P, starts at 28 minutes and there is no continuity so total duration of P is 0 minutes.
So the total duration (minutes) table of the Modeling Codes is this:
Modeling Code Total_Duration
OFF 1
MA 12
V 5
P 0
This models a barchart that looks like this:
How can the total duration of these modeling methods be constructed?
It would also be nice to know the duration of the combinations
such that the only visible combination in this small subset happens to be Modeling Code "MA" paired with Discourse Code "C" and this occurs for 26 - 21 = 5 minutes.
Thank you.
UPDATED SOLUTION
df %>%
mutate(dur = lead(Time_Processed) - Time_Processed) %>%
replace_na(list(dur = 0)) %>%
group_by(`Modeling Code`) %>%
summarise(tot_time = sum(dur))
(^ Thanks to Nick DiQuattro)
PREVIOUS SOLUTION
Here's one solution that creates a new variable, mcode_grp, which keeps track of discrete groupings of the same Modeling Code. It's not particularly pretty - it requires looping over each row in df - but it works.
First, rename columns for ease of reference:
df <- df %>%
rename(m_code = `Modeling Code`,
d_code = `Discourse Code`)
We'll update df with a few extra variables.
- lead_time_proc gives us the Time_Processed value for the next row in df, which we'll need when computing the total amount of time for each m_code batch
- row_n for keeping track of row number in our iteration
- mcode_grp is the unique label for each m_code batch
df <- df %>%
mutate(lead_time_proc = lead(Time_Processed),
row_n = row_number(),
mcode_grp = "")
Next, we need a way to keep track of when we've hit a new batch of a given m_code value. One way is to keep a counter for each m_code, and increment it whenever a new batch is reached. Then we can label all the rows for that m_code batch as belonging to the same time window.
mcode_ct <- df %>%
group_by(m_code) %>%
summarise(ct = 0) %>%
mutate(m_code = as.character(m_code))
This is the ugliest part. We loop over every row in df, and check to see if we've reached a new m_code. If so, we update accordingly, and register a value for mcode_grp for each row.
mc <- ""
for (i in 1:nrow(df)) {
current_mc <- df$m_code[i]
if (current_mc != mc) {
mc <- current_mc
mcode_ct <- mcode_ct %>% mutate(ct = ifelse(m_code == mc, ct + 1, ct))
current_grp <- mcode_ct %>% filter(m_code == mc) %>% select(ct) %>% pull()
}
df <- df %>% mutate(mcode_grp = ifelse(row_n == i, current_grp, mcode_grp))
}
Finally, group_by m_code and mcode_grp, compute the duration for each batch, and then sum over m_code values.
df %>%
group_by(m_code, mcode_grp) %>%
summarise(start_time = min(Time_Processed),
end_time = max(lead_time_proc)) %>%
mutate(total_time = end_time - start_time) %>%
group_by(m_code) %>%
summarise(total_time = sum(total_time)) %>%
replace_na(list(total_time=0))
Output:
# A tibble: 4 x 2
m_code total_time
<fct> <dbl>
1 MA 12.
2 OFF 1.
3 P 0.
4 V 5.
For any dplyr/tidyverse experts out there, I'd love tips on how to accomplish more of this without resorting to loops and counters!
I have a data like this
df <-structure(list(label = structure(c(5L, 6L, 7L, 8L, 3L, 1L, 2L,
9L, 10L, 4L), .Label = c(" holand", " holandindia", " Holandnorway",
" USAargentinabrazil", "Afghanestan ", "Afghanestankabol", "Afghanestankabolindia",
"indiaAfghanestan ", "USA", "USAargentina "), class = "factor"),
value = structure(c(5L, 4L, 1L, 9L, 7L, 10L, 6L, 3L, 2L,
8L), .Label = c("1941029507", "2367321518", "2849255881",
"2913128511", "2927576083", "4550996370", "457707181.9",
"637943892.6", "796495286.2", "89291651.19"), class = "factor")), .Names = c("label",
"value"), class = "data.frame", row.names = c(NA, -10L))
I want to get the largest name (in letter) and then see how many smaller and similar names are and assign them to a group
then go for another next large name and assign them to another group
until no group left
at first I calculate the length of each so I will have the length of them
library(dplyr)
dft <- data.frame(names=df$label,chr=apply(df,2,nchar)[,1])
colnames(dft)[1] <- "label"
df2 <- inner_join(df, dft)
Now I can simply find which string is the longest
df2[which.max(df2$chr),]
Now I should see which other strings have the letters similar to this long string . we have these possibilities
Afghanestankabolindia
it can be
A
Af
Afg
Afgh
Afgha
Afghan
Afghane
.
.
.
all possible combinations but the order of letter should be the same (from left to right) for example it should be Afghand cannot be fAhg
so we have only two other strings that are similar to this one
Afghanestan
Afghanestankabol
it is because they should be exactly similar and not even a letter different (more than the largest string) to be assigned to the same group
The desire output for this is as follows:
label value group
Afghanestan 2927576083 1
Afghanestankabol 2913128511 1
Afghanestankabolindia 1941029507 1
indiaAfghanestan 796495286.2 2
Holandnorway 457707181.9 3
holand 89291651.19 3
holandindia 4550996370 3
USA 2849255881 4
USAargentina 2367321518 4
USAargentinabrazil 637943892.6 4
why indiaAfghanestan is a seperate group? because it does not completely belong to another name (it has partially name from one or another). it should be part of a bigger name
I tried to use this one Find similar strings and reconcile them within one dataframe which did not help me at all
I found something else which maybe helps
require("Biostrings")
pairwiseAlignment(df2$label[3], df2$label[1], gapOpening=0, gapExtension=4,type="overlap")
but still I don't know how to assign them into one group
You could try
library(magrittr)
df$label %>%
tolower %>%
trimws %>%
stringdist::stringdistmatrix(method = "jw", p = 0.1) %>%
as.dist %>%
`attr<-`("Labels", df$label) %>%
hclust %T>%
plot %T>%
rect.hclust(h = 0.3) %>%
cutree(h = 0.3) %>%
print -> df$group
df
# label value group
# 1 Afghanestan 2927576083 1
# 2 Afghanestankabol 2913128511 1
# 3 Afghanestankabolindia 1941029507 1
# 4 indiaAfghanestan 796495286.2 2
# 5 Holandnorway 457707181.9 3
# 6 holand 89291651.19 3
# 7 holandindia 4550996370 3
# 8 USA 2849255881 4
# 9 USAargentina 2367321518 4
# 10 USAargentinabrazil 637943892.6 4
See ?stringdist::'stringdist-metrics' for an overview of the string dissimilarity measures offered by stringdist.
As part of a project, I am currently using R to analyze some data. I am currently stuck with the retrieving few values from the existing dataset which i have imported from a csv file.
The file looks like:
For my analysis, I wanted to create another column which is the subtraction of the current value of x and its previous value. But the first value of every unique i, x would be the same value as it is currently. I am new to R and i was trying various ways for sometime now but still not able to figure out a way to do so. Request your suggestions in the approach that I can follow to achieve this task.
Mydata structure
structure(list(t = 1:10, x = c(34450L, 34469L, 34470L, 34483L,
34488L, 34512L, 34530L, 34553L, 34575L, 34589L), y = c(268880.73342868,
268902.322359863, 268938.194698248, 268553.521856105, 269175.38273083,
268901.619719038, 268920.864512966, 269636.604121984, 270191.206593437,
269295.344751692), i = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L)), .Names = c("t", "x", "y", "i"), row.names = c(NA, 10L), class = "data.frame")
You can use the package data.table to obtain what you want:
library(data.table)
setDT(MyData)[, x_diff := c(x[1], diff(x)), by=i]
MyData
# t x i x_diff
# 1: 1 34287 1 34287
# 2: 2 34789 1 502
# 3: 3 34409 1 -380
# 4: 4 34883 1 474
# 5: 5 34941 1 58
# 6: 6 34045 2 34045
# 7: 7 34528 2 483
# 8: 8 34893 2 365
# 9: 9 34551 2 -342
# 10: 10 34457 2 -94
Data:
set.seed(123)
MyData <- data.frame(t=1:10, x=sample(34000:35000, 10, replace=T), i=rep(1:2, e=5))
You can use the diff() function. If you want to add a new column to your existing data frame, the diff function will return a vector x-1 length of your current data frame though. so in your case you can try this:
# if your data frame is called MyData
MyData$newX = c(NA,diff(MyData$x))
That should input an NA value as the first entry in your new column and the remaining values will be the difference between sequential values in your "x" column
UPDATE:
You can create a simple loop by subsetting through every unique instance of "i" and then calculating the difference between your x values
# initialize a new dataframe
newdf = NULL
values = unique(MyData$i)
for(i in 1:length(values)){
data1 = MyData[MyData$i = values[i],]
data1$newX = c(NA,diff(data1$x))
newdata = rbind(newdata,data1)
}
# and then if you want to overwrite newdf to your original dataframe
MyData = newdf
# remove some variables
rm(data1,newdf,values)
I have a very large csv file. I want to calculate the frequency of the items in the second column in order to graph histogram. An example of my data:
0010,10.1.1.16
0011,10.2.2.10
0012,192.168.2.61
0013,192.168.173.19
0014,10.2.2.10
0015,10.2.2.10
0016,192.168.2.61
I have used the below:
inFile <- read.csv("file.csv")
summary(inFile)
hist(inFile$secondCol)
output of summary:
X0010 X10.1.1.16
Min. :11.00 10.2.2.10 :3
1st Qu.:12.25 192.168.173.19:1
Median :13.50 192.168.2.61 :2
Mean :13.50
3rd Qu.:14.75
Max. :16.00
Because the file is very large, I'm not getting the right histogram. Any suggestions?
Just use table.
DF <- structure(list(V1 = 10:16, V2 = structure(c(1L, 2L, 4L, 3L, 2L,
2L, 4L), .Label = c("10.1.1.16", "10.2.2.10",
"192.168.173.19", "192.168.2.61"), class = "factor")),
.Names = c("V1", "V2"), class = "data.frame",
row.names = c(NA, -7L))
table(DF$V2)
# 10.1.1.16 10.2.2.10 192.168.173.19 192.168.2.61
# 1 3 1 2
If you want a data.frame out of this, you can use as.data.frame:
as.data.frame(table(DF$V2))
# Var1 Freq
# 1 10.1.1.16 1
# 2 10.2.2.10 3
# 3 192.168.173.19 1
# 4 192.168.2.61 2
Since you say you want a histogram, this can be done directly using ggplot2 without having to get the counts first as follows:
require(ggplot2)
ggplot(data = DF, aes(x = V2)) + geom_histogram(aes(y = ..count..))
We could have also done a as.numeric() on the column.
typeof(data$hourofcrime)
# gives me a list
#> typeof(data$hourofcrime)
#[1] "list"
hour_crime_rate <- as.numeric(data$hourofcrime)
hist(hour_crime_rate)