R Table data with a grouping command - r

This seems like a very simple problem, but I can't seem to sort it out. I have sought help from this forum, with the below topics being close, but don't seem to do exactly what I need. I have count data over several years. I want to obtain frequencies of the count value by year. It seems I need a table function with a grouping option, but I haven't found the proper syntax.
Data:
count year
1 15 1957
2 6 1957
3 23 1957
4 23 1957
5 2 1957
6 28 1980
7 15 1980
8 32 1980
9 18 1981
thank you in advance!
Counting the number of elements with the values of x in a vector
grouping data splitted by frequencies
Aggregate data in R

You're looking for the table function. Something like:
with(yourdata, table(Year, Count))

Related

Selecting later date observation in panel data in R

I have the following panel data in R:
ID_column<- c("A","A","A","A","B","B","B","B")
Date_column<-c(20040131, 20041231,20051231,20061231, 20051231, 20061231, 20071231, 20081231)
Price_column<-c(12,13,17,19,35,38,39,41)
Data<- data.frame(ID_column, Date_column, Price_column)
#The data looks like this:
ID_column Date_column Price_column
1: A 20040131 12
2: A 20041231 13
3: A 20051231 17
4: A 20061231 19
5: B 20051231 35
6: B 20061231 38
7: B 20071231 39
8: B 20081231 41
My next aim would be to convert the Date column which is currently in a numeric YYYYMMDD format into YYYY by simply taking the first four digits of each entry in the data column as follows:
Data$Date_column<- substr(Data$Date_column,1,4)
#The data then looks like:
ID_column Date_column Price_column
1 A 2004 12
2 A 2004 13
3 A 2005 17
4 A 2006 19
5 B 2005 35
6 B 2006 38
7 B 2007 39
8 B 2008 41
My ultimate goal would be to employ the plm package for panel data regression, but when applying the package and using pdata.frame to set the ID and Time variables as indices, I get error messages of duplicate ID/Time pairs (In this case rows 1 and 2 which would both be given the tag: A,2004). To solve this issue, I would like to delete row 1 in the original data, and only keep the newer observation from the year 2004. This would the provide me with unique ID/Time pairs across the whole data.
Therefore I was hoping for someone to help me out with a loop or a package suggestion with which I can only keep the row with the newer/later observation within a year, if this occurs, also for application to larger data sets.. I believe this involves a couple commands of conditional formatting which I am having difficulties putting together currently. I believe a loop that evaluates whether the first four digits of consecutive date observations are identical and then deletes the one with the "smaller" date/takes the "larger" date would do it, but my experience with loops is very limited.
Kind regards and thank you!
I'd recommend to keep the Date_column as a reference to pick the later observation and mutate a new column for only the year,since you want the latest observation each year.
Data$year<- substr(Data$Date_column,1,4)
> Data$Date_column<- lubridate::ymd(Data$Date_column)
>
> Data %>% arrange(desc(Date_column)) %>%
+ distinct(ID_column,year,.keep_all = TRUE) %>%
+ arrange(Date_column)
ID_column Date_column Price_column year
1 A 2004-12-31 13 2004
2 A 2005-12-31 17 2005
3 B 2005-12-31 35 2005
4 A 2006-12-31 19 2006
5 B 2006-12-31 38 2006
6 B 2007-12-31 39 2007
since we arranged in the actual date in descending order, you guarantee that dropped rows for the unique combination of ID and year is the oldest. you can change the arrangement for the opposite; to get the oldest occuerence

R: How to plot multiple series when the series is included as a variable?

I want to plot multiple lines to one graph of five different time series. The problem is that my data frame is arranged like so:
Series Time Price ...
1 Dec 2003 5
2 Dec 2003 10
3 Dec 2003 2
1 Jan 2004 10
2 Jan 2004 10
3 Jan 2004 5
This is a simplified version, and there are many other variables for each observation. I'd like to be able to plot time vs price and use the first variable as the indicator for which series.
The time period is 77 months long, so I'm not sure if there's an easy way to reshape the data to look like:
Series Dec.2003.Price Jan.2004.Price ...
1 5 10
2 10 10
3 2 5
or a way to graph these like I said without reshaping.
You can try
xyplot(Price ~ Time, groups=Series, data=df, type="l")

Flatten rows in R data frame by column match

I have a dataset that looks something like this.
year recipient amount id
1 1973 AG 17 7
2 1973 AG 18 7
3 1974 BE 20 9
4 1974 BE 22 9
5 1975 AG 20 7
6 1975 AG 25 7
I'm trying to flatten the rows so that there is only a single row for each recipient per year. I'd like to transform the amount variable to be equal to the sum of all amounts over that year. My ideal result looks like this:
year recipient amount id
1 1973 AG 35 7
2 1974 BE 42 7
3 1975 AG 45 7
I tried writing a loop to accomplish this, but I think that there has to be an easier way that I'm just not familiar with. Maybe a map or flatten function somewhere in a package?
Try:
library(dplyr)
df %>% group_by(year, recipient, id) %>% summarise(amount=sum(amount))
Source: local data frame [3 x 4]
Groups: year, recipient
year recipient id amount
1 1973 AG 7 35
2 1974 BE 9 42
3 1975 AG 7 45
It is probably more power than you need for this simple example, but for this sort of thing, I love the sqldf library which allows you to manipulate data frames like they are database tables using SQL. In your case
library(sqldf)
newdf <- sqldf("SELECT year,recipient,id,sum(amount) as amount from olddf group by year,recipient,id")
by default it uses SQLite as a backend, so it can work with fairly complex SQL statements. I usually find R's data manipulation language to be a little confusing, and ALWAYS have to look up what I'm trying to do, so using SQL instead can be very convenient.
Here is an option using data.table
library(data.table)
setDT(df1)[, list(amount=sum(amount), id= id[1L]) ,.(year, recipient)]
# year recipient amount id
#1: 1973 AG 35 7
#2: 1974 BE 42 9
#3: 1975 AG 45 7
Or if "id" should be also a grouping variable
setDT(df1)[, list(amount=sum(amount)), .(year, recipient, id)]

count unique values in one column for specific values in another column,

I have a data frame on bills that has (among other variables) a column for 'year', a column for 'issue', and a column for 'sub issue.' A simplified example df looks like this:
year issue sub issue
1970 4 20
1970 3 21
1970 4 22
1970 2 8
1971 5 31
1971 4 22
1971 9 10
1971 3 21
1971 4 22
Etc., for about 60 years. I want to count the unique values in the issue and sub issue columns for each year, and use those to create a new df- dat2. Using the df above, dat2 would look like this:
year issues sub issues
1970 3 4
1971 4 4
Weary of factors, I confirmed that the values in all columns are integers, if that makes a difference. I am new at R (obviously), and I haven't been able to find relevant code for this specific purpose online. Thanks for any help!!
That's a one-liner, with aggregate:
with(d,aggregate(cbind(issue,subissue) ~ year,FUN=function(x){length(unique(x))}))
returning:
year issue subissue
1 1970 3 4
2 1971 4 4

copy result of unique() string vector in a dataframe R

I am puzzled by something that I thought would easily work.
I have a dataframe with year, city, and species columns.
species City Year
80 Landpattedyr Sisimiut 2007
83 Landpattedyr Sisimiut 2008
87 Landpattedyr Sisimiut 2009
721733 Havpattedyr Upernavik 2010
721734 Havpattedyr Upernavik 2011
721735 Havpattedyr Upernavik 2007
I have used the function unique as follows
years<-unique(df$year)
city<-unique(df$City)
species<-unique(df$species)
now I need to assign a value in each of those vectors to a dataframe row based on an index, for example
hunting[1,]$year<-year[i]
hunting[1,]$group<-species[j]
hunting[1,]$city<-city[k]
The problem is that only year is copied properly while city and species in the hunting df show up as numbers. I can't figure out why this is happening. Can anybody help please?
year group city lat long total
1 2007 6 19 66.93 -53.66 4563
NA 2007 6 20 72.78 -56.15 91
3 2007 6 8 67.01 -50.72 388
4 2007 6 21 70.66 -52.12 280
5 2007 6 14 77.47 -69.23 469
6 2007 6 5 69.22 -51.10 1114
To find out if a column is factor or character you can use this is.factor(df$City) or is.character(df$City).
In the case of a factor column, the (unique) levels are stored in the levels attribute, which can be accessed with
levels(df$City)
Note: this may include levels that are not present in the vector, for instance, if some rows have been removed or if some levels have been added.
To retrieve the unique elements of a factoror character vector, you can use this:
as.character(unique(df$City))
Which will not return levels that are not present in factor columns.
Note: the last command is slightly more efficient than unique(as.character(df$City)), since the conversion is evaluated on a possibly shorter vector.

Resources