I am attempting to calculate the covariance (or correlation) between the average stem counts of two species. The stem count value averages are in the "avg" column and the species are listed together in the column "Spnum", and they are assigned ID's of 2 and 18. I would like to split out these calculations by Year, Season, and Treatment.
I believe I am getting close using ddply, but I am stuck figuring out how to tell ddply that the values are in a separate column ("avg") than the species that were measured.
row.namesYear Spnum avg Season Treatment
1 1 2005 2 21.8 early delay
2 7 2005 18 18.5 early delay
3 31 2005 2 24.5 early delay
4 37 2005 18 13.2 early delay
5 60 2005 2 20.7 early ambi
6 66 2005 18 31.0 early ambi
7 89 2005 2 36.5 early ambi
...
Here are two options using dplyr and data.table. We group by 'Year', 'Season', 'Treatment' variables and then get the cor of 'avg' that corresponds to 'Spnum' value of 2 againsg the 'Spnum' value of 18 (avg[Spnum==18]).
library(dplyr)
df1 %>%
group_by(Year, Season, Treatment) %>%
summarise(Cor= cor(avg[Spnum==2], avg[Spnum==18]))
Or using data.table, we convert the 'data.frame' to 'data.table' (setDT(df1)). grouped by the variables (as described above), we get the cor.
library(data.table)
setDT(df1)[, list(Cor= cor(avg[Spnum==2], avg[Spnum==18])), by =.(Year, Season, Treatment)]
Related
I have the following panel data in R:
ID_column<- c("A","A","A","A","B","B","B","B")
Date_column<-c(20040131, 20041231,20051231,20061231, 20051231, 20061231, 20071231, 20081231)
Price_column<-c(12,13,17,19,35,38,39,41)
Data<- data.frame(ID_column, Date_column, Price_column)
#The data looks like this:
ID_column Date_column Price_column
1: A 20040131 12
2: A 20041231 13
3: A 20051231 17
4: A 20061231 19
5: B 20051231 35
6: B 20061231 38
7: B 20071231 39
8: B 20081231 41
My next aim would be to convert the Date column which is currently in a numeric YYYYMMDD format into YYYY by simply taking the first four digits of each entry in the data column as follows:
Data$Date_column<- substr(Data$Date_column,1,4)
#The data then looks like:
ID_column Date_column Price_column
1 A 2004 12
2 A 2004 13
3 A 2005 17
4 A 2006 19
5 B 2005 35
6 B 2006 38
7 B 2007 39
8 B 2008 41
My ultimate goal would be to employ the plm package for panel data regression, but when applying the package and using pdata.frame to set the ID and Time variables as indices, I get error messages of duplicate ID/Time pairs (In this case rows 1 and 2 which would both be given the tag: A,2004). To solve this issue, I would like to delete row 1 in the original data, and only keep the newer observation from the year 2004. This would the provide me with unique ID/Time pairs across the whole data.
Therefore I was hoping for someone to help me out with a loop or a package suggestion with which I can only keep the row with the newer/later observation within a year, if this occurs, also for application to larger data sets.. I believe this involves a couple commands of conditional formatting which I am having difficulties putting together currently. I believe a loop that evaluates whether the first four digits of consecutive date observations are identical and then deletes the one with the "smaller" date/takes the "larger" date would do it, but my experience with loops is very limited.
Kind regards and thank you!
I'd recommend to keep the Date_column as a reference to pick the later observation and mutate a new column for only the year,since you want the latest observation each year.
Data$year<- substr(Data$Date_column,1,4)
> Data$Date_column<- lubridate::ymd(Data$Date_column)
>
> Data %>% arrange(desc(Date_column)) %>%
+ distinct(ID_column,year,.keep_all = TRUE) %>%
+ arrange(Date_column)
ID_column Date_column Price_column year
1 A 2004-12-31 13 2004
2 A 2005-12-31 17 2005
3 B 2005-12-31 35 2005
4 A 2006-12-31 19 2006
5 B 2006-12-31 38 2006
6 B 2007-12-31 39 2007
since we arranged in the actual date in descending order, you guarantee that dropped rows for the unique combination of ID and year is the oldest. you can change the arrangement for the opposite; to get the oldest occuerence
This question already has answers here:
Apply a function to groups within a data.frame in R
(4 answers)
Closed 2 years ago.
I am trying to apply a function to a dataframe to add a column which calculates the percentile rank for each record based on Weather Station ID (WSID) and Season Grouping.
## temperatures data frame:
WSID Season Date Temperature
20 Summer 24/01/2020 18
12 Summer 25/01/2020 20
20 Summer 26/01/2020 25
12 Summer 27/01/2020 17
20 Winter 18/10/2020 15
12 Winter 19/10/2020 12
12 Winter 20/10/2020 13
12 Winter 21/10/2020 14
## Code tried:
perc.rank <- function(x) trunc(rank(x))/length(x)
rank.perc = function(mdf) {
combined1 = mdf %>%
mutate(percentile = perc.rank(Temperature))
}
temperatures = temperatures %>%
split(.$WSID) %>%
map_dfr(~rank.perc(.))
## Expected Output :
WSID Season Date Temperature Percentile
20 Summer 24/01/2020 18 0.333
12 Summer 25/01/2020 20 0.444
20 Summer 26/01/2020 25 0.666
12 Summer 27/01/2020 17 0.333
20 Winter 18/10/2020 15
12 Winter 19/10/2020 12
12 Winter 20/10/2020 13
12 Winter 21/10/2020 14
Is there some elegant way to do this using functions such as group_modify, group_split, map and/or split?
I was thinking there should be as for example in case there is a 3 or more level grouping factor.
The code works for when I split the data by WSID but I cant seem to get any further when I want to group also by WSID + Season.
(Filled in Percentile values were calculated from Excel percentile rank function)
You can directly use the function with group_by instead of splitting, also function rank.perc seems unnecessary.
library(dplyr)
perc.rank <- function(x) trunc(rank(x))/length(x)
df %>%
group_by(WSID) %>%
mutate(percentile = perc.rank(Temperature))
In group_by it is easy to add more groupings later eg - group_by(WSID, Season).
In R, I need to calculate several time interval variables between resightings of marked individuals. I have a dataset similar to this:
ID Time Day Month
a 11.15 13 6
a 12.35 13 6
a 10.02 14 6
a 19.30 15 6
a 20.46 15 6
.
.
.
b 11.12 8 7
etc
In which each ID represents a different animal marked for individual recognition, and each row contains the date and time in which it was relocated.
For each individual, I'd need to calculate the number of days each animal was observed, the mean and standard deviation of the number of relocations per day, and the mean and standard deviation of the days elapsed between relocations (including 0 days between observations on the same day.
Ideally, I need to obtain a data frame such this:
ID N.Obs N.days mean.Obs.per.Day m.O.D.sd mean.days.elapsed mde.sd
a 27 7 4.2 1.1 1.5 0.5
b 32 5 3.4 0.4 3.2 0.7
c 17 6 4.4 0.2 4.5 1.2
d etc
I've been doing it in using the tapply function and transferring the results to an Excel, but I am sure there must be a relatively simple code which could help me to ignite the process in R.
The OP has requested to aggregate 6 statistics per ID. Four of them can by directly aggregated by grouping by ID. Two (mean.Obs.per.Day and m.O.D.sd) need to be grouped by date and ID first.
Unfortunately, the time stamps are split up in three different fields, Time, Day, and Month with the year missing. As four of the statistics are based on dates, we need to construct a Date column which combines Day, Month, and a dummy year.
The code below utilises the data.table and lubridate packages for efficiency.
library(data.table)
# coerce to data.table and add Date column
setDT(DF)[, Date := lubridate::make_date(, Month, Day)]
# aggregate by ID,
# use temporary variable to hold the day differences between resightings
agg_per_id <- DF[, {
tmp <- as.numeric(diff(Date))
.(N.Obs = .N, N.days = uniqueN(Date),
mean.days.elapsed = mean(tmp),
mde.sd = sd(tmp))
} , by = ID]
# aggregate by Date and ID
agg_per_day_and_id <- DF[, .N, by = .(ID, Date)][
, .(mean.Obs.per.Day = mean(N), m.O.D.sd = sd(N)), by = ID]
# join partial results
result <- agg_per_day_and_id[agg_per_id, on = "ID"]
# reorder columns (for comparison with expected result)
setcolorder(result, c("ID", "N.Obs", "N.days", "mean.Obs.per.Day",
"m.O.D.sd", "mean.days.elapsed", "mde.sd"))
result
ID N.Obs N.days mean.Obs.per.Day m.O.D.sd mean.days.elapsed mde.sd
1: a 5 3 1.666667 0.5773503 0.5 0.5773503
2: b 1 1 1.000000 NA NaN NA
Note that the figures differ from the expected result of the OP due to different input data.
Data
As far as provided by the OP
DF <- readr::read_table(
"ID Time Day Month
a 11.15 13 6
a 12.35 13 6
a 10.02 14 6
a 19.30 15 6
a 20.46 15 6
b 11.12 8 7"
)
This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 7 years ago.
I have binned data reflecting the width of rivers across each continent. Below is a sample dataset. I pretty much just want to get the data into the form I have shown.
dat <- read.table(text =
"width continent bin
5.32 Africa 10
6.38 Africa 10
10.80 Asia 20
9.45 Africa 10
22.66 Africa 30
9.45 Asia 10",header = TRUE)
How do I melt the above toy dataset to create this dataframe?
Bin Count Continent
10 3 Africa
10 1 Asia
20 1 Asia
30 1 Africa
We could use either one of the aggregate by group.
The data.table option would be to convert the 'data.frame' to 'data.table' (setDT(dat)), grouped by 'continent' and 'bin' variables, we get the number of elements per group (.N)
library(data.table)
setDT(dat)[,list(Count=.N) ,.(continent,bin)]
# continent bin Count
#1: Africa 10 3
#2: Asia 20 1
#3: Africa 30 1
#4: Asia 10 1
Or a similar option with dplyr by grouping the variables and then use n() instead of .N to get the count.
library(dplyr)
dat %>%
group_by(continent, bin) %>%
summarise(Count=n())
Or we can use aggregate from base R and using the formula method, we get the length.
aggregate(cbind(Count=width)~., dat, FUN=length)
# continent bin Count
#1 Africa 10 3
#2 Asia 10 1
#3 Asia 20 1
#4 Africa 30 1
From #Frank's and #David Arenburg's comments, some additional options using data.table and dplyr. We convert the dataset to data.table (setDT(dat)), convert to 'wide' format with dcast, then reconvert it back to 'long' using melt, and subset the roww (value>0)
library(data.table)
melt(dcast(setDT(dat),continent~bin))[value>0]
Using count from dplyr
library(dplyr)
count(dat, bin, continent)
With sqldf:
library(sqldf)
sqldf("SELECT bin, continent, COUNT(continent) AS count
FROM dat
GROUP BY bin, continent")
Output:
bin continent count
1 10 Africa 3
2 10 Asia 1
3 20 Asia 1
4 30 Africa 1
I am puzzled by something that I thought would easily work.
I have a dataframe with year, city, and species columns.
species City Year
80 Landpattedyr Sisimiut 2007
83 Landpattedyr Sisimiut 2008
87 Landpattedyr Sisimiut 2009
721733 Havpattedyr Upernavik 2010
721734 Havpattedyr Upernavik 2011
721735 Havpattedyr Upernavik 2007
I have used the function unique as follows
years<-unique(df$year)
city<-unique(df$City)
species<-unique(df$species)
now I need to assign a value in each of those vectors to a dataframe row based on an index, for example
hunting[1,]$year<-year[i]
hunting[1,]$group<-species[j]
hunting[1,]$city<-city[k]
The problem is that only year is copied properly while city and species in the hunting df show up as numbers. I can't figure out why this is happening. Can anybody help please?
year group city lat long total
1 2007 6 19 66.93 -53.66 4563
NA 2007 6 20 72.78 -56.15 91
3 2007 6 8 67.01 -50.72 388
4 2007 6 21 70.66 -52.12 280
5 2007 6 14 77.47 -69.23 469
6 2007 6 5 69.22 -51.10 1114
To find out if a column is factor or character you can use this is.factor(df$City) or is.character(df$City).
In the case of a factor column, the (unique) levels are stored in the levels attribute, which can be accessed with
levels(df$City)
Note: this may include levels that are not present in the vector, for instance, if some rows have been removed or if some levels have been added.
To retrieve the unique elements of a factoror character vector, you can use this:
as.character(unique(df$City))
Which will not return levels that are not present in factor columns.
Note: the last command is slightly more efficient than unique(as.character(df$City)), since the conversion is evaluated on a possibly shorter vector.