descriptive statistics in table r for multiple variables - r

I am totally new with R and I'll appreciate the time anyone bothers to take with helping me with these probably simple tasks. I'm just at a loss with all the resources available and am not sure where to start.
My data looks something like this:
subject sex age nR medL medR meanL meanR pL ageBin
1 0146si 1 67 26 1 1 1.882353 1.5294118 0.5517241 1
2 0162le 1 72 5 2 1 2 1.25 0.6153846 1
3 0323er 1 54 30 2.5 3 2.416667 2.5 0.4915254 0
4 0811ne 0 41 21 2 2 2 1.75 0.5333333 0
5 0825en 1 44 31 2 2 2.588235 1.8235294 0.5866667 0
Though the actual data has many, many more subjects in variables.
This first thing I need to do is compare the 'ageBin' values. 0 = under age 60, 1 = over age 60. I want to compare stats between these two groups. So I guess the first thing I need is the ability to recognize the different ageBin values and make those the two rows.
Then I need to do things like calculate the frequency of the values in the two groups (ie. how many instances of 1 and 0), the mean of the 'age' variable, the median of the age variable, number of males (ie. sex = 1), the mean of meanL, etc. Simple things like that. I just want them to be all in one table.
So an example of a potential table might be
n nMale mAge
ageBin 0 14 x x
ageBin 1 14 x x
I could easily do this stuff in SPSS or even Excel...I just really want to get started with R. So any resource or advice someone could offer to point me in the right direction would be so, so helpful. Sorry if this sounds unclear...I can try to clarify if necessary.
Thanks in advance, anyone.

Use the plyr() package to split up the data structure and then apply a function to combine all the results back together.
install.packages("plyr") # install package from CRAN
library(plyr) # load the package into R
dd <- list(subject=c("0146si", "0162le", "1323er", "0811ne", "0825en"),
sex = c(1,1,1,0,1),
age = c(67,72,54,41,44),
nR = c(26,5,30,21,31),
medL = c(1,2,2.5,2,2),
medR = c(1,1,3,2,2),
meanL = c(1.882352,2,2.416667,2,2.588235),
meanR = c(1.5294118,1.25,2.5,1.75,1.8235294),
pL = c(0.5517241,0.6153846,0.4915254,0.5333333,0.5866667),
ageBin = c(1,1,0,0,0))
dd <- data.frame(dd) # convert to data.frame
Using the ddply function, you can do things like calculate the frequency of the values in the two groups
ddply(dd, .(ageBin), summarise, nMale = sum(sex), mAge = mean(age))
ageBin nMale mAge
0 2 46.33333
1 2 69.50000
The following is a very useful resource by Sean Anderson for getting up to speed with the plyr package.
A more comprehensive extremely resource by Hadley Wickham the package author can be found here

Try the by function:
if your data frame is named df:
by(data=df, INDICES=df$ageBin, FUN=summary)

Related

Create a new Variable of values of another variable-multilevel regression

I am up to create a multilevel analysis (and I am a total newbie).
In this analysis I want to test if a high value of a predictor( here:senseofhumor) (numeric value - transfered into "high","low","medium") would predict the (numeric)outcome more than the other (numeric)predictors (senseofhomor-seriousness-friednlyness).
I have a dataset with many people and groups and want to compare the outcome between the groups regarding the influence of SenseofhumorHIGH
The code for that might look like this
RandomslopeEC <- lme(criteria(timepoint1) ~ senseofhumor + seriousness + friendlyness , data = DATA, random = ~ **SenseofhumorHIGH**|group)
For that reason I created values "high" "low" "medium" for my numeric predictor via
library(tidyverse)
DATA <- DATA %>%
mutate(predictorNew = case_when(senseofhumor< quantile(senseofhumor, 0.5) ~ 'low',
senseofhumor > quantile(senseofhumor, 0.75)~'high',
TRUE ~ 'med'))
Now they look like this:
Person
Group
senseofhumor
1
56
low
7
1
high
87
7
low
764
45
high
Now I realized i might need to cut this variable values in separate variables if I want to test my idea.
Do any of you know how to generate variables, that they may look like this?
Person
Group
senseofhumorHIGH
senseofhumorMED
senseofhumorLOW
1
56
0
0
1
7
1
1
0
0
87
7
0
0
1
764
45
1
0
0
51
3
1
0
0
362
9
1
0
0
87
27
0
0
1
Does this make any sense to you regarding my approach? Or do you have a better idea?
Thanks a lot in advance
Welcome to learning R. You will want to convert these types of variables to "factors," and R will be able to process them accordingly. To do that, use as.factor(variable) - so for you it may be DATA$senseofhumor <- as.factor(DATA$senseofhumor). If you need to convert multiple columns, you can use:
factor_cols <- c("Var1","Var2","Var3") # list columns you want as factors
DATA[factor_cols] <- lapply(DATA[factor_cols], as.factor)
Since you are new, note that this forum is typically for questions that cant be easily found online. This question is relatively routine and more details can be found with a quick google search. While SO is a great place to learn R, you may be penalized by the SO community in the future for routine questions like this. Just trying to help ensure you keep learning!

Unable to create exactly equal data partitions using createDataPartition in R- getting 1396 and 1398 observations each but need 1397

I am quite familiar with R but never had this requirement where I need to create exactly equal data partition randomly using createDataPartition in R.
index = createDataPartition(final_ts$SAR,p=0.5, list = F)
final_test_data = final_ts[index,]
final_validation_data = final_ts[-index,]
This code creates two datasets with sizes 1396 and 1398 observations respectively.
I am surprised why p=0.5 doesn't do what it is supposed to do. Does it have something to do with resulting dataset not having odd number of observations by default?
Thanks in advance!
It has to do with the number of cases of the response variable (final_ts$SAR in your case).
For example:
y <- rep(c(0,1), 10)
table(y)
y
0 1
10 10
# even number of cases
Now we split:
train <- y[caret::createDataPartition(y, p=0.5,list=F)]
table(train) # we have 10 obs
train
0 1
5 5
test <- y[-caret::createDataPartition(y, p=0.5,list=F)]
table(test) # we have 10 obs.
test
0 1
5 5
If we build and example instead with odd number of cases:
y <- rep(c(0,1), 11)
table(y)
y
0 1
11 11
We have:
train <- y[caret::createDataPartition(y, p=0.5,list=F)]
table(train) # we have 12 obs.
train
0 1
6 6
test <- y[-caret::createDataPartition(y, p=0.5,list=F)]
table(test) # we have 10 obs.
test
0 1
5 5
More info here.
Here is another thread which explains why the number returned from createDataPartition might seem to be "off" to us but not according to what this function is trying to do.
So, it depends on what you have in final_ts$SAR and the spread of the data.
If it is categorical value, ex: T and F, if you have 100 total, 55 are T, 45 are F. When you invoke the way in your code, it will return you 51 because:
55*0.5=27.5, 45*0.5=22.5, round each result up, 28+23=51.
You can refer to below thread which has a great explanation about this when the values you want to split are numbers.
R - caret createDataPartition returns more samples than expected

I need help thinking about how to split a data frame to perform operations

I'm new to R and having a difficult time thinking about the right way to approach a problem. I'm used to doing most of my data analysis in excel, so I think I'm stuck in spreadsheetland. Now I'm getting into data that's too large to do comfortably in excel, so I wanted to step into the light and use R. Thanks in advance for any help you have.
So lets use ChickWeight as an example:
> head(ChickWeight)
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
Say I want to be able to split the data frame by both diet and time point such that it would be easy to generate a table of average weights with Time for columns and Diet for rows. Something like:
0 2 4 6 (time)
1
2 <average weights
3 go in here>
4
(diet)
In my head, the easiest way to do this would be to generate a 2d array containing these values so that I can access them like average_weight[<Time>][<Diet>].
I would like to to be easy to also access all of the average weights for a given time or a given diet using something like average_weight[<Time>][]
I've gotten the sense that I'm not thinking about this problem right, because none of the tools I've found seem to point me in the right direction. The closest I've gotten is using split()
chicks_by_time_and_diet <- split(ChickWeight, list(ChickWeight$Time, ChickWeight$Diet))
But this returns a list of length 55, not a two-dimensional array. I've also tried looking into plyr. This sounded like it was exactly what I wanted, but it's unclear to me exactly how to use it towards this end.
Any help is appreciated, thank you!
Bonus:
In reality my data frame has many more factors than ChickWeight, and if it were possible to access all of the factors for a given 'Time' and 'Diet', that would be ideal.
E.g. pretend that ChickWeight has another factor, height. Would it be possible to store both the average height and weight for a given diet at a particular location in the array such that average_weight_and_height[<Time>][<Diet>] returns a list of (weight, height)?
Using dplyr/tidyr
library(dplyr)
library(tidyr)
ChickWeight %>%
group_by(Time, Diet) %>%
summarise(weight=mean(weight)) %>%
spread(Time, weight)
tapply is made just for this:
> with(ChickWeight, tapply(weight, list(Time, Diet), mean))
1 2 3 4
0 41.40000 40.7 40.8 41.0000
2 47.25000 49.4 50.4 51.8000
4 56.47368 59.8 62.2 64.5000
6 66.78947 75.4 77.9 83.9000
8 79.68421 91.7 98.4 105.6000
10 93.05263 108.5 117.1 126.0000
12 108.52632 131.3 144.4 151.4000
14 123.38889 141.9 164.5 161.8000
16 144.64706 164.7 197.4 182.0000
18 158.94118 187.7 233.1 202.9000
20 170.41176 205.6 258.9 233.8889
21 177.75000 214.7 270.3 238.5556
You can also use data.table or dplyr, though you will need to reshape the results of those to get to the 2D (or 3D) formats:
library(data.table)
DT <- data.table(ChickWeight)[, mean(weight), by=.(Time, Diet)]
dcast.data.table(DT, Time ~ Diet)
Or, as Arun points out (here we just use a normal data frame):
reshape2::dcast(ChickWeight, Time ~ Diet, value.var="weight", fun.aggregate=mean)
A lot of R analysis involves getting comfortable with data in "long format" (see DT before we dcast it), where dimensions are represented by columns.

How to calculate differences in column values based on value markers (ex.1 or 2) in a different column of the same .csv file in R?

I have a .csv file with several columns, but I am only interested in two of the columns(TIME and USER). The USER column consists of the value markers 1 or 2 in chunks and the TIME column consists of a value in seconds. I want to calculate the difference between the TIME value of the first 2 in a chunk in the USER column and the first 1 in a chunk in the USER column. I want to accomplish this through R. It would be ideal for their to be another column added to my data file with these differences.
So far I have only imported the .csv into R.
Latency <- read.csv("/Users/alinazjoo/Documents/Latency_allgaze.csv")
I'm going to guess your data looks like this
# sample data
set.seed(15)
rr<-sample(1:4, 10, replace=T)
dd<-data.frame(
user=rep(1:5, each=10),
marker=rep(rep(1:2,10), c(rbind(rr, 5-rr))),
time=1:50
)
Then you can calculate the difference using the base function aggregate and transform. Observe
namin<-function(...) min(..., na.rm=T)
dx<-transform(aggregate(
cbind(m2=ifelse(marker==2,time,NA), m1=ifelse(marker==1, time,NA)) ~ user,
dd, namin, na.action=na.pass),
diff = m2-m1)
dx
# user m2 m1 diff
# 1 1 4 1 3
# 2 2 15 11 4
# 3 3 23 21 2
# 4 4 35 31 4
# 5 5 44 41 3
We use aggregate to find the minimal time for each of the two kinds or markers, then we use transform to calculate the difference between them.

Unit of Analysis Conversion

We are working on a social capital project so our data set has a list of an individual's organizational memberships. So each person gets a numeric ID and then a sub ID for each group they are in. The unit of analysis, therefore, is the group they are in. One of our variables is a three point scale for the type of group it is. Sounds simple enough?
We want to bring the unit of analysis to the individual level and condense the type of group it is into a variable signifying how many different types of groups they are in.
For instance, person one is in eight groups. Of those groups, three are (1s), three are (2s), and two are (3s). What the individual level variable would look like, ideally, is 3, because she is in all three types of groups.
Is this possible in the least?
##simulate data
##individuals
n <- 10
## groups
g <- 5
## group types
gt <- 3
## individuals*group membership
N <- 20
## inidividuals data frame
di <- data.frame(individual=sample(1:n,N,replace=TRUE),
group=sample(1:g,N, replace=TRUE))
## groups data frame
dg <- data.frame(group=1:g, type=sample(1:gt,g,replace=TRUE))
## merge
dm <- merge(di,dg)
## order - not necessary, but nice
dm <- dm[order(dm$individual),]
## group type per individual
library(plyr)
dr <- ddply(dm, "individual", function(x) length(unique(x$type)))
> head(dm)
group individual type
2 2 1 2
8 2 1 2
20 5 1 1
9 3 3 2
12 3 3 2
17 4 3 2
> head(dr)
individual V1
1 1 2
2 3 1
3 4 2
4 5 1
5 6 1
6 7 1
I think what you're asking is whether it is possible to count the number of unique types of group to which an individual belongs.
If so, then that is certainly possible.
I wouldn't be able to tell you how to do it in R since I don't know a lot of R, and I don't know what your data looks like. But there's no reason why it wouldn't be possible.
Is this data coming from a database? If so, then it might be easier to write a SQL query to compute the value you want, rather than to do it in R. If you describe your schema, there should be lots of people here who could give you the query you need.

Resources