Subsetting a data frame - Confused about syntax - r

Say I have the following data frame:
LungCap Age Height Smoke Gender Caesarean
1 6.475 6 62.1 no male no
2 10.125 18 74.7 yes female no
3 9.550 16 69.7 no female yes
4 11.125 14 71.0 no male no
5 4.800 5 56.9 no male no
6 6.225 11 58.7 no female no
Now I want to select all rows where the age is > 11 and gender is female. This gets me what I want:
y[y$Age>11&y$Gender=="female",]
LungCap Age Height Smoke Gender Caesarean
2 10.125 18 74.7 yes female no
3 9.550 16 69.7 no female yes
But this does not:
y[y$Age>11&y$Gender=="female"]
Age Height
1 6 62.1
2 18 74.7
3 16 69.7
4 14 71.0
5 5 56.9
6 11 58.7
I'm very new at R and I don't understand what this second query is doing, other than it's not giving me what I want.

When you subset the dataframe with the first syntax, the first number vector (or logic vector) in the square brackets represents the rows you want to select, while the second (after the comma) represents the columns.
If you do not explicitly insert anything after the comma, R assumes you want all the columns.
If you do not even put the comma, R assumes that the first number refers to what columns you want.
In your case y$Age>11&y$Gender=="female" is a logic vector that refers to position 2 and 3. So if you do not use comma, R thinks you want to only select columns 2 and 3. Therefore you get Age and Height.

Related

how to find row Means of a data frame includes categorical variables. summing up numeric data of a category group? [duplicate]

This question already has answers here:
Mean per group in a data.frame [duplicate]
(8 answers)
Closed 1 year ago.
a data of penicillin production including four treatment(A,B,C,D)'our columns' and five blocks'row'.
I need to calculate sum and mean of each row separately. dataframe brings the variable in col and I cannot define variables of treatment A and sum it up. I wanna know how to write them the way that I can have 4 numbers in each row in order to calculate its mean and sum...
here is my code:
pencilline=c(89,88,97,94,84,77,92,79,81,87,87,85,87,92,89,84,79,81,80,88)
treatment=factor(rep(LETTERS[1:4],times=5))
block=sort(rep(1:5,times=4))
datap=data.frame(pencilline,block,treatment)
datap
datap_subset=unlist(lapply(datap,is.numeric))
datap_subset
pencilline block treatment
TRUE TRUE FALSE
rowMeans(datap[,datap_subset])
[1] 45.0 44.5 49.0 47.5 43.0 39.5 47.0 40.5 42.0 45.0 45.0 44.0 45.5 48.0 46.5 44.0 42.0 43.0 42.5 46.5
which gives false rowMeans.
Do you want this?
library(dplyr)
datap %>% group_by(block) %>%
summarise(mean = mean(pencilline))
# A tibble: 5 x 2
block mean
<int> <dbl>
1 1 92
2 2 83
3 3 85
4 4 88
5 5 82
its baseR equivalent
aggregate(pencilline ~ block, datap, mean)
block pencilline
1 1 92
2 2 83
3 3 85
4 4 88
5 5 82

how to fit curve with two exponential terms in R

the formula is
f=exp(-d*t)+exp(g*t)-1
my dataset includes many observations(f in the formula) on several different times(t) for several subjects. And I want to get estimations on d and g.
How should I code for this in R? And I don't know how to determine starting values, since every subject might have different curve shapes.
Here are some hypothetical examples:
subject t f
1 1 0 515.6
2 1 70 62.9
3 1 126 34.8
4 1 181 18.5
5 1 245 28.9
6 1 289 29.6
7 1 359 109.1
8 1 408 33.2
9 1 531 16.9
10 1 569 97.2
I have hundreds of subjects, and I want to estimate the parameters (d and g) on personal level, means different curve for different subject.

Group function by two variables on data.table

My data looks something like this
students<-data.table(studid=c(1:6) ,FACULTY= c("IT","SCIENCE", "LAW","IT","IT","IT"),
SEX=c("Male","Male","Male","Female","Female","Male"), WAM=c(65,35,98,55,20,80))
studid FACULTY SEX AVE_MARK (WAM)
1 IT Male 65
2 SCIENCE Male 35
3 LAW Male 98
4 IT Female 55
5 IT Female 20
6 IT Male 80
I have used the following code to calculate the averages
degrees[, mean(WAM, na.rm=T),by=FACULTY][order(-V1)]
So my headings are
FACULTY VI
IT 65
LAW 50
etc
Any advice on how to do this would be greatly appreciated.
I would like to break this up by sex also
FACULTY VI VI
Male Female
IT 65 11
LAW 50 11
You could try
dcast.data.table(students, FACULTY~SEX, fun.aggregate=mean, na.rm=TRUE,
value.var='WAM')
# FACULTY Female Male
#1: IT 37.5 72.5
#2: LAW NaN 98.0
#3: SCIENCE NaN 35.0
Do you definitely need it in cross tabular format? If so, akrun's answer is the way to go.
Otherwise, here they are stacked:
> students[, mean(WAM, na.rm=T),by=c('FACULTY','SEX')]
FACULTY SEX V1
1: IT Male 72.5
2: SCIENCE Male 35.0
3: LAW Male 98.0
4: IT Female 37.5

How to input a 3-way table?

I have the data in the table form (not even a R table) and I want to transform (or input) it into R to perform analysis.
The table is a 3-way contingency table which looks like this:
Is there a way to easily input this into R? (It can by any format as long as I can perform some regression analysis)
Or I need to manually input it?
In R, this is an ftable.
Inputting an ftable manually is not too difficult if you know how the function works. The data need to be in a format like this:
breathless yes no
coughed yes no
age
20-24 9 7 95 1841
25-29 23 9 108 1654
30-34 54 19 177 1863
If the data are in this format, you can use read.ftable. For example:
temp <- read.ftable(textConnection("breathless yes no
coughed yes no
age
20-24 9 7 95 1841
25-29 23 9 108 1654
30-34 54 19 177 1863"))
temp
# breathless yes no
# coughed yes no yes no
# age
# 20-24 9 7 95 1841
# 25-29 23 9 108 1654
# 30-34 54 19 177 1863
From there, if you want a "long" data.frame, with which analysis and reshaping to different formats is much easier, just wrap it in data.frame().
data.frame(temp)
# age breathless coughed Freq
# 1 20-24 yes yes 9
# 2 25-29 yes yes 23
# 3 30-34 yes yes 54
# 4 20-24 no yes 95
# 5 25-29 no yes 108
# 6 30-34 no yes 177
# 7 20-24 yes no 7
# 8 25-29 yes no 9
# 9 30-34 yes no 19
# 10 20-24 no no 1841
# 11 25-29 no no 1654
# 12 30-34 no no 1863

combine two pieces of data in R

i'm not sure how to best describe this, so i'll just show you. i have two variables.
A:
ID
1 121
2 122
3 123
4 124
5 125
6 126
7 127
8 128
9 129
and B:
var1 var2 var3
1 57.1 116.5 73.0
2 38.1 15.8 22.7
3 84.2 99.2 72.2
and i would like them to end up as such:
ID
1 121 57.1
2 122 116.5
3 123 73.0
4 124 38.1
5 125 15.8
6 126 22.7
7 127 84.2
8 128 99.2
9 129 72.2
does that make sense? i'd like to maintain the original variable and add a column that is the rows, in order, of the other variable. preferably i'd like this as a data frame.
thanks in advance.
data.frames and matrices are filled by column by default, as you want create a numeric vector filling by row, you will need to transpose the data.frame before coercing to a numeric variable, so it will be in the order you want.
A$value <- c(t(B))
transposing a data.frame gives a matrix, which is coerced to a numeric vector by c.
Assuming B is a data.frame, you can do:
cbind(A,var.name=as.vector(as.matrix(B)))
You can pass the new column name instead of var.name

Resources