copy result of unique() string vector in a dataframe R - r

I am puzzled by something that I thought would easily work.
I have a dataframe with year, city, and species columns.
species City Year
80 Landpattedyr Sisimiut 2007
83 Landpattedyr Sisimiut 2008
87 Landpattedyr Sisimiut 2009
721733 Havpattedyr Upernavik 2010
721734 Havpattedyr Upernavik 2011
721735 Havpattedyr Upernavik 2007
I have used the function unique as follows
years<-unique(df$year)
city<-unique(df$City)
species<-unique(df$species)
now I need to assign a value in each of those vectors to a dataframe row based on an index, for example
hunting[1,]$year<-year[i]
hunting[1,]$group<-species[j]
hunting[1,]$city<-city[k]
The problem is that only year is copied properly while city and species in the hunting df show up as numbers. I can't figure out why this is happening. Can anybody help please?
year group city lat long total
1 2007 6 19 66.93 -53.66 4563
NA 2007 6 20 72.78 -56.15 91
3 2007 6 8 67.01 -50.72 388
4 2007 6 21 70.66 -52.12 280
5 2007 6 14 77.47 -69.23 469
6 2007 6 5 69.22 -51.10 1114

To find out if a column is factor or character you can use this is.factor(df$City) or is.character(df$City).
In the case of a factor column, the (unique) levels are stored in the levels attribute, which can be accessed with
levels(df$City)
Note: this may include levels that are not present in the vector, for instance, if some rows have been removed or if some levels have been added.
To retrieve the unique elements of a factoror character vector, you can use this:
as.character(unique(df$City))
Which will not return levels that are not present in factor columns.
Note: the last command is slightly more efficient than unique(as.character(df$City)), since the conversion is evaluated on a possibly shorter vector.

Related

Selecting later date observation in panel data in R

I have the following panel data in R:
ID_column<- c("A","A","A","A","B","B","B","B")
Date_column<-c(20040131, 20041231,20051231,20061231, 20051231, 20061231, 20071231, 20081231)
Price_column<-c(12,13,17,19,35,38,39,41)
Data<- data.frame(ID_column, Date_column, Price_column)
#The data looks like this:
ID_column Date_column Price_column
1: A 20040131 12
2: A 20041231 13
3: A 20051231 17
4: A 20061231 19
5: B 20051231 35
6: B 20061231 38
7: B 20071231 39
8: B 20081231 41
My next aim would be to convert the Date column which is currently in a numeric YYYYMMDD format into YYYY by simply taking the first four digits of each entry in the data column as follows:
Data$Date_column<- substr(Data$Date_column,1,4)
#The data then looks like:
ID_column Date_column Price_column
1 A 2004 12
2 A 2004 13
3 A 2005 17
4 A 2006 19
5 B 2005 35
6 B 2006 38
7 B 2007 39
8 B 2008 41
My ultimate goal would be to employ the plm package for panel data regression, but when applying the package and using pdata.frame to set the ID and Time variables as indices, I get error messages of duplicate ID/Time pairs (In this case rows 1 and 2 which would both be given the tag: A,2004). To solve this issue, I would like to delete row 1 in the original data, and only keep the newer observation from the year 2004. This would the provide me with unique ID/Time pairs across the whole data.
Therefore I was hoping for someone to help me out with a loop or a package suggestion with which I can only keep the row with the newer/later observation within a year, if this occurs, also for application to larger data sets.. I believe this involves a couple commands of conditional formatting which I am having difficulties putting together currently. I believe a loop that evaluates whether the first four digits of consecutive date observations are identical and then deletes the one with the "smaller" date/takes the "larger" date would do it, but my experience with loops is very limited.
Kind regards and thank you!
I'd recommend to keep the Date_column as a reference to pick the later observation and mutate a new column for only the year,since you want the latest observation each year.
Data$year<- substr(Data$Date_column,1,4)
> Data$Date_column<- lubridate::ymd(Data$Date_column)
>
> Data %>% arrange(desc(Date_column)) %>%
+ distinct(ID_column,year,.keep_all = TRUE) %>%
+ arrange(Date_column)
ID_column Date_column Price_column year
1 A 2004-12-31 13 2004
2 A 2005-12-31 17 2005
3 B 2005-12-31 35 2005
4 A 2006-12-31 19 2006
5 B 2006-12-31 38 2006
6 B 2007-12-31 39 2007
since we arranged in the actual date in descending order, you guarantee that dropped rows for the unique combination of ID and year is the oldest. you can change the arrangement for the opposite; to get the oldest occuerence

Format function output as data frame

I use the following to sum several measures of TABR per Julian date in 5 separate years:
TABR_YearDay<-with(wsmr, tapply(TABR, list(Julian, Year),sum))
Which produces output like:
2015 2016 2017 2018 2019
33 NA NA NA 2 NA
....
80 NA 1 NA 21 NA
81 NA 47 NA 25 NA
82 NA 12 1 9 NA
But I want to convert these results into a dataframe with 6 columns Julian + 2015-2019.
I tried:
TABR_Day<-as.data.frame(TABR_YearDay)
But that seems to not produce a fully realized df: there is no column for Julian and if I want to call an individual variable, I have to use quotes around it like:
hist(TABR_Day$"2017")
Can you help me transition the function output to a dataframe with 6 viable columns?
The column is present in the rownames of the result. Column names usually don't start with a number, we can prepend the column name with the word 'Year' to make it 'Year_2015' etc and construct the final dataframe.
TABR_YearDay<-with(wsmr, tapply(TABR, list(Julian, Year),sum))
colnames(TABR_YearDay) <- paste0('Year_', colnames(TABR_YearDay))
TABR_Day <- data.frame(Julian = rownames(TABR_YearDay), TABR_YearDay)

Sum column values that match year in another column in R

I have the following dataframe
y<-data.frame(c(2007,2008,2009,2009,2010,2010),c(10,13,10,11,9,10),c(5,6,5,7,4,7))
colnames(y)<-c("year","a","b")
I want to have a final data.frame that adds together within the same year the values in "y$a" in the new "a" column and the values in "y$b" in the new "b" column so that it looks like this"
year a b
2007 10 5
2008 13 6
2009 21 12
2010 19 11
The following loop has done it for me,
years<- as.numeric(levels(factor(y$year)))
add.a<- numeric(length(y[,1]))
add.b<- numeric(length(y[,1]))
for(i in years){
ind<- which(y$year==i)
add.a[ind]<- sum(as.numeric(as.character(y[ind,"a"])))
add.b[ind]<- sum(as.numeric(as.character(y[ind,"b"])))
}
y.final<-data.frame(y$year,add.a,add.b)
colnames(y.final)<-c("year","a","b")
y.final<-subset(y.final,!duplicated(y.final$year))
but I just think there must be a faster command. Any ideas?
Kindest regards,
Marco
The aggregate function is a good choice for this sort of operation, type ?aggregate for more information about it.
aggregate(cbind(a,b) ~ year, data = y, sum)
# year a b
#1 2007 10 5
#2 2008 13 6
#3 2009 21 12
#4 2010 19 11

count unique values in one column for specific values in another column,

I have a data frame on bills that has (among other variables) a column for 'year', a column for 'issue', and a column for 'sub issue.' A simplified example df looks like this:
year issue sub issue
1970 4 20
1970 3 21
1970 4 22
1970 2 8
1971 5 31
1971 4 22
1971 9 10
1971 3 21
1971 4 22
Etc., for about 60 years. I want to count the unique values in the issue and sub issue columns for each year, and use those to create a new df- dat2. Using the df above, dat2 would look like this:
year issues sub issues
1970 3 4
1971 4 4
Weary of factors, I confirmed that the values in all columns are integers, if that makes a difference. I am new at R (obviously), and I haven't been able to find relevant code for this specific purpose online. Thanks for any help!!
That's a one-liner, with aggregate:
with(d,aggregate(cbind(issue,subissue) ~ year,FUN=function(x){length(unique(x))}))
returning:
year issue subissue
1 1970 3 4
2 1971 4 4

R Table data with a grouping command

This seems like a very simple problem, but I can't seem to sort it out. I have sought help from this forum, with the below topics being close, but don't seem to do exactly what I need. I have count data over several years. I want to obtain frequencies of the count value by year. It seems I need a table function with a grouping option, but I haven't found the proper syntax.
Data:
count year
1 15 1957
2 6 1957
3 23 1957
4 23 1957
5 2 1957
6 28 1980
7 15 1980
8 32 1980
9 18 1981
thank you in advance!
Counting the number of elements with the values of x in a vector
grouping data splitted by frequencies
Aggregate data in R
You're looking for the table function. Something like:
with(yourdata, table(Year, Count))

Resources