Aggregate columns in data.table for descriptive statistics - r

I am looking at a student data set at the individual student level.
What I want to do is do some descriptive analysis at the faculty degree level.
That is some students are doing two degrees (double degrees eg Bachelor of IT and Bachelor of Science) so some students generate two degrees.
My data looks something like the below. The Faculty assignments (whether FAC1 or FAC2) are arbitrary.
studid FAC1 FAC2 SUCCESS SEX AVE_MARK
1 IT ARTS 0 Male 65
2 SCIENCE 1 Male 35
3 LAW 0 Male 98
4 IT SCIENCE 0 Female 55
5 COMMERCE IT 0 Female 20
6 COMMERCE IT 1 Male 80
This was generated with
students<-data.table(studid=c(1:6) ,FAC1 = c("IT","SCIENCE", "LAW","IT","COMMERCE","COMMERCE"), FAC2 = c("ARTS","","","SCIENCE","IT","IT"), SUCCESS = c(0,1,0,0,0,1), SEX=c("Male","Male","Male","Female","Female","Male"), AVE_MARK=c(65,35,98,55,20,80))
How would I go about producing something like this (made up figures) to create a Faculty column that incorporates both FAC1 and FAC2 columns? I have been trying to use the lapply function across FAC1 and FAC2 but keep hitting dead ends (ie students[, lapply(.SD, mean), by=agg.by, .SDcols=c('FAC1', 'FAC2')]
FACULTY MEAN_SUCCESS AVE_MARK
IT 0.65 65
SCIENCE 1 50
LAW 0.76 50
ARTS 0.55 50
COMMERCE 0.40 10
Any assistance would be greatly appreciated.

This seems like what you are looking for.
library(reshape2)
DT <- melt(students,measure.vars=c("FAC1","FAC2"),value.name="FACULTY")[nchar(FACULTY)>0]
DT[,list(mean_success=mean(SUCCESS),ave_mark=mean(AVE_MARK)),by=FACULTY]
# FACULTY mean_success ave_mark
# 1: IT 0.25 55
# 2: SCIENCE 0.50 45
# 3: LAW 0.00 98
# 4: COMMERCE 0.50 50
# 5: ARTS 0.00 65
So this uses the melt(...) function in package reshape2 to collapse the two faculty columns, replicating all the other columns. Unfortunately, this results in some columns with blank faculty, so we have to get rid of those using [nchar(FACULTY)>0]. Then it's simple to aggregate based on the (new) FACULTY column.

Related

Nested logit model using panel data in R

I am new to R and I would love it if you can help me with this because I am having serious difficulties.
I have unbalanced panel data that shows monthly companies' performance compared to the rest of the market in terms of $$ (eg. this month company 1 has made $1000 more than the average of the market). Each of these companies had decided on a strategy when they entered the market (1 through 8). These strategies are nested into two different groups (a,b) so that strategies 1,2, and 3 are part of the group a, while strategies 4 through 8 are part of group b. I would need a rank of the best strategies from best to worst.
I have discretized my DV so that now it only shows whether that month company 1 performed higher or lower than the market. However, I am not sure it is the right way because I would then lose how much better or worse each month companies performed compared to the market.
My data looks like this:
ID Main Strategy YearMonth DiffPerformance Control1 Control 2 DiffPerformanceHL
1 a 2 201706 9.037 2 57 H
1 a 2 201707 4.371 2 57 H
1 a 2 201708 1.633 2 57 H
1 a 2 201709 -3.521 2 59 L
1 a 2 201710 13.096 2 59 H
1 a 2 201711 5.070 2 60 H
1 a 2 201712 4.25 2 60 H
2 b 5 201904 6.78 4 171 H
2 b 5 201905 -15.26 4 169 L
2 b 5 201906 7.985 4 169 H
Where ID is the company, Main is the group (a or b) Strategies are 1 through 8 and nested as previously stated, YearMonth represents the specific month, DifferencePerformance is the DV as a continuous variable, Control 1 is static over time and is a categorical variable (1 through 6), Control 2 is a control count variable that changes over time, and DiffPerformance HL is the discretized DV.
Can you please help me figuring out how to create a nested logit model in R? I would be super appreciative
Thanks

Remove NA's from a stacked bar chart created using likertplot function from the HH package

I am creating stacked-bar-charts using the likertplot function from the HH package to display summary results from a recent student survey.
The code I have used to produce this plot is:
likertplot(Subgroup ~ . | Group, data = SOCIETIES_DATA,
as.percent=TRUE,
main='Did you attend the City Societies Fair?',
ylab=NULL,
scales = list(y = list(relation = "free")),
between=list(y=0),
layout = c(1, 5))
Where SOCIETIES_DATA is my dataframe that contains frequency data for the number of students from particular demographics that selected an answer to a single question (in this case if they attended the societies fair). Group is a column for the name of the Demographic categories (e.g. Age, Accommodation) and Subgroup is the categories within the groups (e.g. for Age, <18, 18-20. 21-24 etc.).
Unfortunately I am receiving unwanted NA values values on the second Y axis of the chart for particular variables (in my example, Age and Fee status).
Outputted likert plot from R
My data is formatted the same as it is for other data I have used to create likertplots in the same way, for which I have had no issues. Therefore the error is unlikely to be due to data and thus from the likertplot function.
Most likely, the error is occurring in the scales = argument since this has been affecting the number of NA levels presented in each section of the stacked-bar-chart when editing the code.
I have read through the documentation for the likertplot function in the HH package as well as Heiberger and Robbins (2014) Design of Diverging Stacked Bar Charts for Likert Scales and Other Applications, but have found no solutions to this issue.
The data I have used is presented below.
Did not attend Yes and poor range of stalls Yes and good range of stalls Subgroup Group
1 107 23 155 Halls Accommodation
2 81 7 54 Home Accommodation
3 10 2 5 Prefer not to answer Accommodation
4 71 13 90 Rented private accommodation Accommodation
5 9 1 4 <18 Age
6 192 33 220 18-20 Age
7 37 6 64 21-24 Age
8 27 4 17 25-39 Age
9 6 1 1 40 and over Age
10 2 0 1 Prefer not to answer Age
11 29 6 57 EU Fee Status
12 195 31 198 Home Fee Status
13 34 8 43 International Fee Status
14 15 0 9 Prefer not to answer Fee Status
15 48 10 59 Arts, Design and Social Sciences Faculty
16 75 10 86 Business and Law Faculty
17 34 12 64 Engineering and Environment Faculty
18 53 8 59 Health and Life Sciences - City Campus Faculty
19 59 5 36 Health and Life Sciences - Coach Lane Campus Faculty
20 52 6 61 Foundation Study Mode
21 1 1 1 Postgraduate Research Study Mode
22 13 2 18 Postgraduate Taught Study Mode
23 207 36 227 Undergraduate Study Mode
Any help would be greatly appreciated.
I was able to solve this myself and the answer was actually pretty simple. The categories for each group must be independent. I had the option 'prefer not to say' for both age and Fee status which was causing the error.

R average over ages when some ages missing

I have a data.table with columns for Age, food category, and the kcal consumed. I'm trying to get the average kcal for each category, but for some of the categories there is no consumption in that category. So I can't take a simple average, because there are zeroes that aren't in the data.table.
So for the example data:
dtp2 <- data.table(age = c(4,4,4,5,6,18), category = c("chips","vegetables","pizza","chips","pizza","beer"), kcal = c(100,5,100,120,100,150))
just doing dtp2[,mean(kcal),by=category] gives the wrong answer because only the 18 year olds are consuming beer, and the 4-17 year olds aren't.
The actual data set is 4:18 year olds with many, many categories. I've tried populating the datatable with zeroes for omitted ages with a nested for loop, which is very slow, then taking the means as above.
Is there a sensible R way of taking the mean kcal where missing values are assumed to be zero, without nested for loops putting in the zeroes?
I take it you want to include missing or 0 kcal values in the calculation. Instead of taking the average, you could just sum by category and divide by the total n for each category.
The suggestion by Mr. Bugle is rather generic and doesn't show any code. Picking this up, the code of the OP needs to be modified as follows:
library(data.table)
dtp2[, sum(kcal) / uniqueN(dtp2$category), by = category]
which returns
category V1
1: chips 55.00
2: vegetables 1.25
3: pizza 50.00
4: beer 37.50
Note that uniqueN(dtp2$category) is used not just uniqueN(category) as this is always 1 when grouped by category.
However, there are situations where missing values are assumed to be zero, without nested for loops putting in the zeroes as the OP has asked.
One situation could arise when data is reshaped from long to wide format for presentation of the data:
reshape2::dcast(dtp2, age ~ category, fun = mean, value.var = "kcal", margins = TRUE)
age beer chips pizza vegetables (all)
1 4 NaN 100 100 5 68.33333
2 5 NaN 120 NaN NaN 120.00000
3 6 NaN NaN 100 NaN 100.00000
4 18 150 NaN NaN NaN 150.00000
5 (all) 150 110 100 5 95.83333
Here, the margin means are computed only from the available data which is not what the OP askd for. (Note that the parameter fill = 0 has no effect on the computation of the margins.)
So, the missing values need to be filled up before reshaping. In base R, expand.grid() can be used for this purpose, in data.table it's the cross join function CJ():
expanded <- dtp2[CJ(age, category, unique = TRUE), on = .(age = V1, category = V2)
][is.na(kcal), kcal := 0][]
expanded
age category kcal
1: 4 beer 0
2: 4 chips 100
3: 4 pizza 100
4: 4 vegetables 5
5: 5 beer 0
6: 5 chips 120
7: 5 pizza 0
8: 5 vegetables 0
9: 6 beer 0
10: 6 chips 0
11: 6 pizza 100
12: 6 vegetables 0
13: 18 beer 150
14: 18 chips 0
15: 18 pizza 0
16: 18 vegetables 0
Now, reshaping from long to wide returns the expected results:
reshape2::dcast(expanded, age ~ category, fun = mean, value.var = "kcal", margins = TRUE)
age beer chips pizza vegetables (all)
1 4 0.0 100 100 5.00 51.2500
2 5 0.0 120 0 0.00 30.0000
3 6 0.0 0 100 0.00 25.0000
4 18 150.0 0 0 0.00 37.5000
5 (all) 37.5 55 50 1.25 35.9375

Subtracting Values in Previous Rows: Ecological Lifetable Construction

I was hoping I could get some help. I am constructing a life table, not for insurance, but for ecology (a cross-sectional of the population of a any kind of wild fauna), so essentially censoring variables like smoker/non-smoker, pregnant, gender, health-status, etc.:
AgeClass=C(1,2,3,4,5,6)
SampleSize=c(100,99,87,46,32,19)
for(i in 1:6){
+ PropSurv=c(Sample/100)
+ }
> LifeTab1=data.frame(cbind(AgeClass,Sample,PropSurv))
Which gave me this:
ID AgeClas Sample PropSurv
1 1 100 1.00
2 2 99 0.99
3 3 87 0.87
4 4 46 0.46
5 5 32 0.32
6 6 19 0.19
I'm now trying to calculate those that died in each row (DeathInt) by taking the initial number of those survived and subtracting it by the number below it (i.e. 100-99, then 99-87, then 87-46, so on and so forth). And try to look like this:
ID AgeClas Sample PropSurv DeathInt
1 1 100 1.00 1
2 2 99 0.99 12
3 3 87 0.87 41
4 4 46 0.46 14
5 5 32 0.32 13
6 6 19 0.19 NA
I found this and this, and I wasn't sure if they answered my question as these guys subtracted values based on groups. I just wanted to subtract values by row.
Also, just as a side note: I did a for() to get the proportion that survived in each age group. I was wondering if there was another way to do it or if that's the proper, easiest way to do it.
Second note: If any R-users out there know of an easier way to do a life-table for ecology, do let me know!
Thanks!
If you have a vector x, that contains numbers, you can calculate the difference by using the diff function.
In your case it would be
LifeTab1$DeathInt <- c(-diff(Sample), NA)

sentiment analysis with different number of documents

I am trying to do sentiment analysis on newspaper articles and track the sentiment level across time. To do that, basically I will identify all the relevant news articles within a day, feed them into the polarity() function and obtain the average polarity scores of all the articles (more precisely, the average of all the sentence from all the articles) within that day.
The problem is, for some days, there will be many more articles compared to other days, and I think this might mask some of the info if we simply track the daily average polarity score. For example, a score of 0.1 from 30 news articles should carry more weight compared to a score of 0.1 generated from only 3 articles. and sure enough, some of the more extreme polarity scores I obtained came from days whereby there are only few relevant articles.
Is there anyway I can take the different number of articles each day into consideration?
library(qdap)
sentence = c("this is good","this is not good")
polarity(sentence)
I would warn that sometimes saying something strong with few words may pack the most punch. Make sure what you're doing makes sense in terms of your data and research questions.
One approach would be to use number of words as in the following example (I like the first approach moreso here):
poldat2 <- with(mraja1spl, polarity(dialogue, list(sex, fam.aff, died)))
output <- scores(poldat2)
weight <- ((1 - (1/(1 + log(output[["total.words"]], base = exp(2))))) * 2) - 1
weight <- weigth/max(weight)
weight2 <- output[["total.words"]]/max(output[["total.words"]])
output[["weighted.polarity"]] <- output[["ave.polarity"]] * weight
output[["weighted.polarity2"]] <- output[["ave.polarity"]] * weight2
output[, -c(5:6)]
## sex&fam.aff&died total.sentences total.words ave.polarity weighted.polarity weighted.polarity2
## 1 f.cap.FALSE 158 1641 0.083 0.143583793 0.082504197
## 2 f.cap.TRUE 24 206 0.044 0.060969157 0.005564434
## 3 f.mont.TRUE 4 29 0.079 0.060996614 0.001397106
## 4 m.cap.FALSE 73 651 0.031 0.049163984 0.012191207
## 5 m.cap.TRUE 17 160 -0.176 -0.231357933 -0.017135804
## 6 m.escal.FALSE 9 170 -0.164 -0.218126656 -0.016977931
## 7 m.escal.TRUE 27 590 -0.067 -0.106080866 -0.024092720
## 8 m.mont.FALSE 70 868 -0.047 -0.078139272 -0.025099276
## 9 m.mont.TRUE 114 1175 -0.002 -0.003389105 -0.001433481
## 10 m.none.FALSE 7 71 0.066 0.072409049 0.002862997
## 11 none.none.FALSE 5 16 -0.300 -0.147087026 -0.002925046

Resources