create list and generate descriptives for each variable - r

I want to generate descriptive statistics for multiple variables at a time (close to 50), rather than writing out the code several times.
Here is a very basic example of data:
id var1 var2
1 1 3
2 2 3
3 1 4
4 2 4
I typically write out each line of code to get a frequency count and descriptives, like so:
library(psych)
table(df$var1)
table(df1$var2)
describe(df1$var1)
describe(df1$var2)
I would like to create a list and get the output from these analyses, rather than writing out 100 lines of code. I tried this, but it is not working:
variable_list<-list(df1$var, df2$var)
for (variable in variable_list){
table(df$variable_list))
describe(df$variable_list))}
Does anyone have advice on getting this to work?

The describe from psych can take a data.frame and returns the descriptive statistics for each column
library(psych)
describe(df1)
# vars n mean sd median trimmed mad min max range skew kurtosis se
#id 1 4 2.5 1.29 2.5 2.5 1.48 1 4 3 0 -2.08 0.65
#var1 2 4 1.5 0.58 1.5 1.5 0.74 1 2 1 0 -2.44 0.29
#var2 3 4 3.5 0.58 3.5 3.5 0.74 3 4 1 0 -2.44 0.29
If it is subset of columns, specify either column index or column name to select and subset the dataset
describe(df1[2:3])
Another option is descr from collapse
library(collapse)
descr(slt(df1, 2:3))
Or to select numeric columns
descr(num_vars(df1))
Or for factors
descr(fact_vars(df1))

Related

How to sort a vector in R without repeating ranks

Good afternoon ,
My question may seem very elementary but i'm getting troubles with it.
Assume we have the following vector :
x=c(0.75,0.75,1,1,0.5,0.5,0.5,0.25,0.25)
I'm willing to sort the vector decreasingly , then getting indices which means :
sort.int(x, index.return=TRUE,decreasing=TRUE)
$x
[1] 1.00 1.00 0.75 0.75 0.50 0.50 0.50 0.25 0.25
$ix
[1] 3 4 1 2 5 6 7 8 9
However, the expected output should be :
y=c(2,2,1,1,3,3,3,4,4)
This means :
1 is the highest value ----- > 1
0.75 is the second highest value ----- > 2
0.5 is the third ----- > 3
0.25 is the lowest value -----> 4
I also tried :
x=c(0.75,0.75,1,1,0.5,0.5,0.5,0.25,0.25)
order(unique(sort(x)))
sort(unique(x),decreasing=TRUE)
[1] 1 2 3 4
[1] 1.00 0.75 0.50 0.25
But I don't know how to subset from x to get the expected output y .
Thank you for help !
sort will sort all the values, and use each value once. It seems like you want to ignore the indices of duplicated values after the first. We can use match for this, which will always return the index of the first match.
match(sort.int(x, decreasing = TRUE), unique(x))
# [1] 2 2 1 1 3 3 3 4 4

How to generate summary statistics (similar to psych::describeBy) for subgroups of subgroups, within a larger dataset?

New to R (for biostats) here! I have a huge dataset, and am using describe() and describeBy() from the psych package. But I'm also trying to find a way to do basic stats for subgroups within subgroups.
For example, my dataset is about membership within a club, which has Chinese and Indian members. Other variables include gender, age, height, weight, BMI, etcetera.
I have figured out psych::describeBy to look at means and standard deviation for subgroups defined by one variable, e.g. ethnicity, but I can't figure out how to narrow this down further so that I generate a summary only for Chinese male members.
I tried redefining using the subset() function, and then running describeBy again, e.g.
chinese <- subset(maindata, chinese=1)
describeBy(chinese, male=1)
But this didn't work, and the results were the same as describeBy(maindata,chinese=1), rather than the Chinese male subset.
I hope that makes sense.
The only other solution I can think of is to breakdown my main dataset into smaller ones in MS Excel and re-uploading each separately (e.g. Chinese.xls, Indian.xls), or to create a new variable with defined by a combination of ethnicity-gender, e.g. Chinesemale=1, Chinesefemale=2, Indianmale=3, Indianfemale=4.
I more or less will need to analyse by these subgroups of subgroups for t-tests and Fisher's exact, so any good package recommendations that would help address these would be appreciated!
Thanks in advance!!
Sample Data
df1 <- data.frame(subject = c(1, 2, 3, 4, 5),
chinese = c(1, 1, 1, 0, 0),
male = c(1, 0, 1, 0, 1),
value = c(45, 23, 84, 11, 12))
Two changes in syntax from your code:
double equal sign in subset(). You want to keep rows where chinese is equal to 1. You would use a single equal sign if you were assigning a value of 1 to a parameter called chinese.
In describeBy(), the group parameter gives you different summary statistics for each category in that column (as shown below). You can't use it to subset for male=1.
chinese <- subset(df1, chinese == 1)
describeBy(chinese, group = "male")
Descriptive statistics by group
group: 0
vars n mean sd median trimmed mad min max range skew kurtosis se
subject 1 1 2 NA 2 2 0 2 2 0 NA NA NA
chinese 2 1 1 NA 1 1 0 1 1 0 NA NA NA
male 3 1 0 NA 0 0 0 0 0 0 NA NA NA
value 4 1 23 NA 23 23 0 23 23 0 NA NA NA
-------------------------------------------------------------------------------------------------------------------------------------
group: 1
vars n mean sd median trimmed mad min max range skew kurtosis se
subject 1 2 2.0 1.41 2.0 2.0 1.48 1 3 2 0 -2.75 1.0
chinese 2 2 1.0 0.00 1.0 1.0 0.00 1 1 0 NaN NaN 0.0
male 3 2 1.0 0.00 1.0 1.0 0.00 1 1 0 NaN NaN 0.0
value 4 2 64.5 27.58 64.5 64.5 28.91 45 84 39 0 -2.75 19.5
If you only want to see the summary stats for males in the sample, you could add & male == 1 to the subset():
chinese <- subset(df1, chinese == 1 & male == 1)
describeBy(chinese)
vars n mean sd median trimmed mad min max range skew kurtosis se
subject 1 2 2.0 1.41 2.0 2.0 1.48 1 3 2 0 -2.75 1.0
chinese 2 2 1.0 0.00 1.0 1.0 0.00 1 1 0 NaN NaN 0.0
male 3 2 1.0 0.00 1.0 1.0 0.00 1 1 0 NaN NaN 0.0
value 4 2 64.5 27.58 64.5 64.5 28.91 45 84 39 0 -2.75 19.5

Countifs in SAS

I have a SAS dataset with 3 columns. A FirmIndex, a ProducIndex and a third column called PrChange. Within each FirmIndex & ProductIndex group I want to count how many PrChange are different from . and from 0, and add that to a column called Number. Then I want to divide that column Number by the number of observations within each group which are not ..
Below an example of the dataset and desired output.
data prod;
input firmindex productindex PrChange Number Fract;
cards;
1 1 . 1 0.250
1 1 0.00 1 0.250
1 1 0.00 1 0.250
1 1 -0.40 1 0.250
1 1 0.00 1 0.250
1 2 . 2 1.000
1 2 1.00 2 1.000
1 2 0.30 2 1.000
1 3 . 4 0.800
1 3 0.70 4 0.800
1 3 1.00 4 0.800
1 3 0.70 4 0.800
1 3 0.00 4 0.800
1 3 -0.30 4 0.800
1 4 . 5 1.000
1 4 0.20 5 1.000
1 4 -1.00 5 1.000
1 4 -0.90 5 1.000
1 4 -0.50 5 1.000
1 4 1.00 5 1.000
2 1 . 2 1.000
2 1 0.30 2 1.000
2 1 -0.50 2 1.000
2 2 . 5 0.714
2 2 0.30 5 0.714
2 2 0.10 5 0.714
2 2 0.00 5 0.714
2 2 0.00 5 0.714
2 2 0.80 5 0.714
2 2 -0.20 5 0.714
2 2 0.40 5 0.714
2 3 . 1 1.000
2 3 0.60 1 1.000
2 4 . 5 0.714
2 4 -1.00 5 0.714
2 4 0.80 5 0.714
2 4 -0.20 5 0.714
2 4 0.00 5 0.714
2 4 0.00 5 0.714
2 4 -0.70 5 0.714
2 4 0.90 5 0.714
2 5 . 3 1.000
2 5 0.90 3 1.000
2 5 -0.70 3 1.000
2 5 -0.50 3 1.000
;
run;
Here is what I tried to generate the column number, but it is not working:
data work.prod;
set work.prod;
by firmindex productindex;
if first.productindex or first.firmindex then sum = 0;
else if PrChange ne 0 and PrChange ne .;
sum = sum + 1;
run;
Your problem here is that you need the number to divide by prior to running the rows of data. This is where SAS is different from Excel; SAS is row-based, meaning it takes your code and runs it against each row of data (more or less) one at a time, rather than dynamically looking at every cell from every other cell (like Excel). Much faster and more efficient but less flexible for stuff like this.
Your particular question begs for a DoW loop. This takes over the normal data step loop and performs its own loop - twice. Once to calculate the number/fract values, then once to copy those to the BY group. Note I only check for last.productIndex; last/first transitions are always set on a second by variable when they're true for the first by variable.
Here we do the first loop once for the first set of values( the first 5 records) then we re-loop through the same 5 records. Then for the next 3. Etc. Each time the two loops take the same number of rows so they always stay in sync.
data want;
do _n_ = 1 by 1 until (last.productIndex);
set have;
by firmindex productindex;
number_denom = sum(number_Denom,not missing(PrChange));
number = sum(number, not (PrChange in (.,0)));
end;
fract = number/number_denom;
do _n_ = 1 by 1 until (last.productIndex);
set have;
by firmindex productindex;
output;
end;
run;
I'm going to give the IML answer that I'm able to give. Rick or someone else more IML-savvy probably can do better than this. In R or other matrix language I think this would be much easier, but I don't have the IML chops to do this without looping; maybe it's possible.
proc iml;
use have;
read all var _all_ into h;
u = h[uniqueby(h,1:2), 1:2]; *generate the "unique" categories for the first two columns;
v = j(nrow(h),5); *generate a matrix to save this into;
v[,1:3] = h; *start it out with the first three columns of the dataset;
do i = 1 to nrow(u); *iterate over the unique category matrix;
number = ncol(loc(h[loc((h[,1:2] = u[i,1:2])[,#]),3]));
*the inner LOC produces a two column 1/0 matrix with match 1 / nomatch 0 for each col
then reduce to 1 column via subscript reduction product, to get correct 1/0 match vector
the outer LOC takes the rows of h from that (so rows of h matching u), then returns nonzero/nonmissing
which then ncol summarizes into a count;
fract_denom = ncol(loc(h[loc((h[,1:2] = u[i,1:2])[,#]),3] ^= .));
*similar, but here we have to verify they are not missing explicitly, considering 0 valid;
v[loc((v[,1:2] = u[i,1:2])[,#]),4] = number; *assign to col4 of V;
v[loc((v[,1:2] = u[i,1:2])[,#]),5] = number/fract_denom; *assign to col5 of V;
end;
print v;
quit;
This uses the unique-loc method, more or less, with some modifications; probably is an easier way to get the matches.
A SQL in SAS solution - Parfait's is probably the better one overall, but SAS's willingness to remerge makes the SASsy solution a bit simpler.
proc sql;
create table want as
select firmindex, productindex, prchange,
sum (not (prchange in (0,.))) as number,
calculated number / (sum ( not missing(prchange))) as fract
from have
group by firmindex, productindex;
quit;
SAS will do the grouping/counting/etc. and then merge back to the original dataset with no problem, skipping the need for correlated subqueries. NOT standard SQL, but quite common in SAS nonetheless.
Consider proc sql using conditional CASE WHEN correlated subqueries:
proc sql;
create table ProdChangeCount as
SELECT p.firmindex, p.productindex,
(SELECT SUM(CASE WHEN sub.PrChange ^= . AND sub.PrChange ^= 0 THEN 1 ELSE 0 END)
FROM Prod sub
WHERE sub.firmindex = p.firmindex
AND sub.productindex = p.productindex) AS Number,
CALCULATED Number /
(SELECT Count(*)
FROM Prod sub
WHERE sub.PrChange ^= .
AND sub.firmindex = p.firmindex
AND sub.productindex = p.productindex) AS Frac
FROM Prod p;
quit;

R - conditional cumsum using multiple columns

I'm new to stackoverflow So I hope I post my question in the right format. I have a test dataset with three columns where rank is the ranking of a cell, Esvalue is the value of a cell and zoneID is an area identifier(Note! in the real dataset I have up to 40.000 zoneIDs)
rank<-seq(0.1,1,0.1)
Esvalue<-seq(10,1)
zoneID<-rep(seq.int(1,2),times=5)
rank Esvalue zoneID
0.1 10 1
0.2 9 2
0.3 8 1
0.4 7 2
0.5 6 1
0.6 5 2
0.7 4 1
0.8 3 2
0.9 2 1
1.0 1 2
I want to calculate the following:
% ES value <- For each rank, including all lower ranks, the cumulative % share of the total ES value relative to the ES value of all zones
cumsum(df$Esvalue)/sum(df$Esvalue)
% ES value zone <- For each rank, including all lower ranks, the cumulative % share of the total Esvalue relative to the ESvalue of a zoneID for each zone. I tried this now using mutate and using dplyr. Both so far only give me the cumulative sum, not the share. In the end this will generate a variable for each zoneID
df %>%
mutate(cA=cumsum(ifelse(!is.na(zoneID) & zoneID==1,Esvalue,0))) %>%
mutate(cB=cumsum(ifelse(!is.na(zoneID) & zoneID==2,Esvalue,0)))
These two variables I want to combine by
1) calculating the abs difference between the two for all the zoneIDs
2) for each rank calculate the mean of the absolute difference over all zoneIDs
In the end the final output should look like:
rank Esvalue zoneID mean_abs_diff
0.1 10 1 0.16666667
0.2 9 2 0.01333333
0.3 8 1 0.12000000
0.4 7 2 0.02000000
0.5 6 1 0.08000000
0.6 5 2 0.02000000
0.7 4 1 0.04666667
0.8 3 2 0.01333333
0.9 2 1 0.02000000
1.0 1 2 0.00000000
Now I created the last using some intermediate steps in Excel but my final dataset will be way too big to be handled by Excel. Any advice on how to proceed would be appreciated

Subtracting Values in Previous Rows: Ecological Lifetable Construction

I was hoping I could get some help. I am constructing a life table, not for insurance, but for ecology (a cross-sectional of the population of a any kind of wild fauna), so essentially censoring variables like smoker/non-smoker, pregnant, gender, health-status, etc.:
AgeClass=C(1,2,3,4,5,6)
SampleSize=c(100,99,87,46,32,19)
for(i in 1:6){
+ PropSurv=c(Sample/100)
+ }
> LifeTab1=data.frame(cbind(AgeClass,Sample,PropSurv))
Which gave me this:
ID AgeClas Sample PropSurv
1 1 100 1.00
2 2 99 0.99
3 3 87 0.87
4 4 46 0.46
5 5 32 0.32
6 6 19 0.19
I'm now trying to calculate those that died in each row (DeathInt) by taking the initial number of those survived and subtracting it by the number below it (i.e. 100-99, then 99-87, then 87-46, so on and so forth). And try to look like this:
ID AgeClas Sample PropSurv DeathInt
1 1 100 1.00 1
2 2 99 0.99 12
3 3 87 0.87 41
4 4 46 0.46 14
5 5 32 0.32 13
6 6 19 0.19 NA
I found this and this, and I wasn't sure if they answered my question as these guys subtracted values based on groups. I just wanted to subtract values by row.
Also, just as a side note: I did a for() to get the proportion that survived in each age group. I was wondering if there was another way to do it or if that's the proper, easiest way to do it.
Second note: If any R-users out there know of an easier way to do a life-table for ecology, do let me know!
Thanks!
If you have a vector x, that contains numbers, you can calculate the difference by using the diff function.
In your case it would be
LifeTab1$DeathInt <- c(-diff(Sample), NA)

Resources