Frequency count per observation in R - r

I am trying to do a frequency count of a categorical variable (i.e., upper division class) per case in a dataset that is currently in long format. I am using R.
Current data set:
Student_ID   Class  UD_class
111       PSY 400   1
111      ENG 310   0
111       EE 510  1
I would like to conver it to a frame that looks like this:
Student_ID   UD_class
111        2
I tried using this code and this is providing me wrong frequencies:
data.frame(table(data$Student_ID, data$UD_class))
Any suggestions on how I can do this in R? Thank you!

Try:
with(data[data$UD_class==1,], data.frame(table(Student_ID))

Try as.data.frame instead of data.frame. To maintain your column headings use the with function: as.data.frame(with(df, table(StID, ud_class)))

Related

Is there an R function for setting rows on aggregate data?

The data I am working with is from eBird, and I am looking to sort out species occurrence by both name and year. There are over 30k individual observations, each with its own number of birds. From the raw data I posted below, on Jan 1, 2021 and someone observed 2 Cooper's Hawks, etc.
Raw looks like this:
specificName   indivualCount  eventDate  year
Cooper's Hawk    1    (1/1/2018)   2018
Cooper's Hawk    1    (1/1/2020)    2020
Cooper's Hawk    2    (1/1/2021)    2021
Ideally, I would be able to group all the Cooper's Hawks specificName by the year they were observed and sum the total invidualcounts. That way I can make statistical comparisons between the number of birds observed in 2018, 2019, 2020, & 2021.
I created the separate column for the year
year <- as.POSIXct(ebird.df$eventDate, format = "%m/%d/%Y") ebird.df$year <- as.numeric(format(year, "%Y"))
Then aggregated with the follwing:
aggdata <- aggregate(ebird.df$individualCount , by = list( ebird.df$specificname, ebird.df$year ), FUN = sum)
There are hundreds of bird species, so Cooper's Hawks start on the 115th row so the output looks like this:
  Group.1   Group.2    x
115   2018  Cooper's Hawk  86
116   2019  Cooper's Hawk  152
117   2020  Cooper's Hawk  221
118   2021  Cooper's Hawk  116
My question is how to I get the data to into a table that looks like the following:
Species Name   2018 2019 2020 2021
Cooper's Hawk   86   152  221  116
I want to eventually run some basic ecology stats on the data using vegan, but one problem first I guess lol
Thanks!
There are errors in the data and code in the question so we used the code and reproducible data given in the Note at the end.
Now, using xtabs we get an xtabs table directly from ebird.df like this. No packages are used.
xtabs(individualCount ~ specificName + year, ebird.df)
## year
## specificName 2018 2020 2021
## Cooper's Hawk 1 1 2
Optionally convert it to a data.frame:
xtabs(individualCount ~ specificName + year, ebird.df) |>
as.data.frame.matrix()
## 2018 2020 2021
## Cooper's Hawk 1 1 2
Although we did not need to use aggdata if you need it for some other reason then it can be computed using aggregate.formula like this:
aggregate(individualCount ~ specificName + year, ebird.df, sum)
Note
Lines <- "specificName,individualCount,eventDate,year
\"Cooper's Hawk\",1,(1/1/2018),2018
\"Cooper's Hawk\",1,(1/1/2020),2020
\"Cooper's Hawk\",2,(1/1/2021),2021"
ebird.df <- read.csv(text = Lines, strip.white = TRUE)

Fisher exact test in R & SAS : different pvaule between two program

I have the data like this(about credit rating and default)
credit rating
Normal
Default
1st grade
220
0
2nd grade
737
3
3rd grade
680
7
4th grade
73
3
5th grade
6
0
I took Fisher exact test in R.
First, I save the data in the vector name as "chisq".
Second, took the fisher.exact test using this code.
fisher.test(chisq,hybrid=TRUE)
Then, I got the pvalue 0.02791.
But my colleague took same test in SAS and he got pvalue 0.0503.
I can't understand why the result of test different between R and SAS.
Please Help.

R markdown export diacritics

I am trying to export a summary of categorical variables through the R markdown. The output of the summary is in the Czech language with diacritics, but the R isn't able to encode it.
Example
summary(data$Et6_d1q_ii)
3 a více hodin 1 - 2 hodiny méně než 1 hodinu žádný NA's
113 240 196 111 6932
Is there any way how to set the endocing globally so the output is readable? I wasn't able to find it anywhere.
Thank you!

Locating specific datapoints in R and merging matrices

I have two datasets and i need to merge specific points from these two datasets in a third matrix which i will create.
I am trying to create a matrix with stock returns for all the companies in my dataset.
My dataset of the companies (referencedata) looks like this:
Company PERNMO earlengage
A 45643 6/7/2011
B 86743 9/12/2012
C 75423 3/4/2011
D 95345 2/11/2011
......
My dataset of the stock returns (datastock) looks like this:
PERNMO date returns
11456 1/3/2011 3.4%
11456 1/4/2011 5.4%
11456 1/5/2011 0.5%
11456 1/6/2011 1.2%
11456 1/7/2011 0.7%
......
I need to use the PERMNO code in referencedata as an identifier to locate the company i am looking for in datastock. At the same time, i need to use earlengage in referencedata as an identifier to find the same date in datastock and then select the 250 returns datapoints prior to that day in datastock.
I want to put all these 250 datapoints for each stock in one matrix (250 rows for the returns & n columns relating to the number of stocks).
I am struggling to replicate the equivalent of the vlookup function in Excel. The output matrix would look like this:
PERNMO date returns
45643 1/3/2011 3.4%
45643 1/4/2011 5.4%
45643 1/5/2011 0.5%
......
45643 6/7/2011 1.2%
(this is the earlengage date)
Any help would be much appreciated.
The way I see it, you are trying to solve two problems in one shot. The first one is merging, and the other is taking the last 250 data points and converting it into a matrix. I'd approach this problem in the simplest way possible by going through the rows one by one rather than trying to solve it using one function
# Sorting so that we can take the bottom 250 rows to find the latest data
datastock = datastock[order(datastock$date),]
dataMatrix = NULL
for (i in 1:nrow(referencedata))
{
single_stock_data = subset(datastock, PERNMO == referencedata$PERNMO[i] &
date < referencedata$PERNMO[i])
dataMatrix = cbind(dataMatrix, tail(single_stock_data$returns, 251)[1:250]
}
I haven't tested the code but this should work.

r time series to panel by group and month

So I have a dataframe with about 500,000 obs that looks like this:
ID MonthYear Group
123 200811 Blue
345 201102 Red
678 201110 Blue
910 201303 Green
I would like to convert this to a panel that counts the number of occurrences for each group in each month. So it would look like this:
MonthYear Group Count
200801 Blue 521
200802 400
....
200801 Red 521
200802 600
....
I guess it doesn't need to look exactly like that, but just some way to turn this into a useful panel. Aggregate doesn't seem to be sufficient in and of itself.
aggregate(dfrm$ID, dfrm[,c("MonthYear","Group")], length)
If you want to reverse the grouping just reverse the order of the INDEX argument.

Resources