TABLE of age groups - r

I have different (2 in my example, 85 in my real data) and would like to produce a table of age classes (0-10, 11-20,21-30,31-40 etc.) for each group:
group age
1 1 34
2 1 37
3 1 22
4 1 10
5 1 11
6 1 12
7 1 14
8 2 56
9 2 46
10 2 25
11 2 24
12 2 13
13 2 13
14 2 45
15 2 45
16 2 23
17 2 56
18 2 54
19 2 31
20 2 68
I have tried various solutions from the forum:
mydf$ageclass<-cut(mydf$age, seq(0,100,10))
only works for the entire df and has no possibilty of groups.
mydf$ageclass<-Freq(mydf$age, breaks=c(0,20,30,40,50,60,70,80))
also only returns a solution for the entire dataframe
I have no way of integrating the "group" into these functions.
Also, both return a column with the age class given as '(30,40]' (meaning upper and lower class bound) and I would like the result to be a table like this:
group 0-10 11-20 21-30 31-40
1
2
What am I missing? perhaps a for loop? I am new to base R and really would enjoy some pointers as to how to think about the problem.

Is this what you are trying to achieve?
df$ageclass <- with(mydf, cut(age, seq(0,100,10)))
with(df, table(group, ageclass))
ageclass
group (0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80] (80,90] (90,100]
1 1 3 1 1 0 0 0 0 0 0
2 0 2 3 1 3 3 1 0 0 0
Edit
cut() also has a labels argument:
df$ageclass <- with(mydf, cut(age, seq(0,100,10), labels = paste0(seq(0,90,10) + 1, "-", seq(0,90,10) + 10)))
with(df, table(group, ageclass))
ageclass
group 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100
1 1 3 1 1 0 0 0 0 0 0
2 0 2 3 1 3 3 1 0 0 0
Data
mydf <- structure(list(group = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), age = c(37L, 22L, 10L,
11L, 12L, 14L, 56L, 46L, 25L, 24L, 13L, 13L, 45L, 45L, 23L, 56L,
54L, 31L, 68L)), row.names = c(NA, -19L), class = "data.frame")

Related

Calculate the most freq repeated difference between rows for each group in a df in r

Given a data frame like below:
Name No Diff Most repeated Diff
A 24
A 35
A 39
A 41
A 42
A 43
B 32
B 35
B 36
B 37
C 34
C 40
C 42
D 34
D 39
D 44
E 35
E 36
how to calculate last column as the most freq repeated diff of rows? (e.g, for each I want to calculate the difference of rows and then see which difference more repeated- in this case A would be 1 with two differences equal to 1).
Thanks in advance.
We can use diff to calculate difference and table to count their frequency
library(dplyr)
df %>%
group_by(Name) %>%
mutate(diff = c(NA, diff(No)),
#Can also use lag to get difference with previous value
#diff = No - lag(No),
most_repeated_diff = names(which.max(table(diff))))
# Name No diff most_repeated_diff
# <fct> <int> <int> <chr>
# 1 A 24 NA 1
# 2 A 35 11 1
# 3 A 39 4 1
# 4 A 41 2 1
# 5 A 42 1 1
# 6 A 43 1 1
# 7 B 32 NA 1
# 8 B 35 3 1
# 9 B 36 1 1
#10 B 37 1 1
#11 C 34 NA 2
#12 C 40 6 2
#13 C 42 2 2
#14 D 34 NA 5
#15 D 39 5 5
#16 D 44 5 5
#17 E 35 NA 1
#18 E 36 1 1
data
df <- structure(list(Name = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L), .Label = c("A",
"B", "C", "D", "E"), class = "factor"), No = c(24L, 35L, 39L,
41L, 42L, 43L, 32L, 35L, 36L, 37L, 34L, 40L, 42L, 34L, 39L, 44L,
35L, 36L)), class = "data.frame", row.names = c(NA, -18L))

Calculating within group differences R

I’m trying to figure out how to append a column that identifies whether a difference of 10 exists between different IDs for a given day using the column named reading.
**Day ID Reading**
19-Jan 1 10
19-Jan 1 10
19-Jan 1 10
19-Jan 1 20
19-Jan 2 20
19-Jan 2 20
19-Jan 2 20
19-Jan 2 20
20-Jan 1 10
21-Jan 1 10
22-Jan 1 10
23-Jan 1 10
24-Jan 1 20
25-Jan 2 20
25-Jan 2 20
25-Jan 2 20
25-Jan 2 10
I would like:
**Day ID Reading Difference**
19-Jan 1 10 Y
19-Jan 1 10 Y
19-Jan 1 10 Y
19-Jan 1 20 Y
19-Jan 2 20 N
19-Jan 2 20 N
19-Jan 2 20 N
19-Jan 2 20 N
20-Jan 1 10 N
21-Jan 1 10 N
22-Jan 1 10 N
23-Jan 1 10 N
24-Jan 1 20 N
25-Jan 2 20 Y
25-Jan 2 20 Y
25-Jan 2 20 Y
25-Jan 2 10 Y
What you could do is to check whether the difference of the range is equal to or greater than 10 for each group.
dat$Diff <- with(dat, ave(Reading, Day, ID, FUN = function(x) diff(range(x)) >= 10))
dat
# Day ID Reading Diff
#1 19-Jan 1 10 1
#2 19-Jan 1 10 1
#3 19-Jan 1 10 1
#4 19-Jan 1 20 1
#5 19-Jan 2 20 0
#6 19-Jan 2 20 0
#7 19-Jan 2 20 0
#8 19-Jan 2 20 0
#9 20-Jan 1 10 0
#10 21-Jan 1 10 0
#11 22-Jan 1 10 0
#12 23-Jan 1 10 0
#13 24-Jan 1 20 0
#14 25-Jan 2 20 1
#15 25-Jan 2 20 1
#16 25-Jan 2 20 1
#17 25-Jan 2 10 1
data
dat <- structure(list(Day = c("19-Jan", "19-Jan", "19-Jan", "19-Jan",
"19-Jan", "19-Jan", "19-Jan", "19-Jan", "20-Jan", "21-Jan", "22-Jan",
"23-Jan", "24-Jan", "25-Jan", "25-Jan", "25-Jan", "25-Jan"),
ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L), Reading = c(10L, 10L, 10L, 20L, 20L, 20L,
20L, 20L, 10L, 10L, 10L, 10L, 20L, 20L, 20L, 20L, 10L)), .Names = c("Day",
"ID", "Reading"), class = "data.frame", row.names = c(NA, -17L
))
We can use data.table
library(data.table)
setDT(df1)[, Difference := abs(Reduce(`-`, as.list(range(Reading)))) >= 10,
.(ID, Day)]
df1
# Day ID Reading Difference
# 1: 19-Jan 1 10 TRUE
# 2: 19-Jan 1 10 TRUE
# 3: 19-Jan 1 10 TRUE
# 4: 19-Jan 1 20 TRUE
# 5: 19-Jan 2 20 FALSE
# 6: 19-Jan 2 20 FALSE
# 7: 19-Jan 2 20 FALSE
# 8: 19-Jan 2 20 FALSE
# 9: 20-Jan 1 10 FALSE
#10: 21-Jan 1 10 FALSE
#11: 22-Jan 1 10 FALSE
#12: 23-Jan 1 10 FALSE
#13: 24-Jan 1 20 FALSE
#14: 25-Jan 2 20 TRUE
#15: 25-Jan 2 20 TRUE
#16: 25-Jan 2 20 TRUE
#17: 25-Jan 2 10 TRUE
data
df1 <- structure(list(Day = c("19-Jan", "19-Jan", "19-Jan", "19-Jan",
"19-Jan", "19-Jan", "19-Jan", "19-Jan", "20-Jan", "21-Jan", "22-Jan",
"23-Jan", "24-Jan", "25-Jan", "25-Jan", "25-Jan", "25-Jan"),
ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L), Reading = c(10L, 10L, 10L, 20L, 20L, 20L,
20L, 20L, 10L, 10L, 10L, 10L, 20L, 20L, 20L, 20L, 10L)),
class = "data.frame", row.names = c(NA, -17L))
Using tidyverse you could do something like
library(tidyverse)
your_data %>%
group_by(Day, ID) %>%
mutate(difference = (max(difference) - min(difference)) >= 10)

R Convert dataset to a column of a second (matching) dataset

I have two datasets in R, and I am trying to add values of the first dataset into a single column of the second dataset. The two datasets have matching variables, based on these variables the new column should be constructed.
The first dataset looks like this:
Experiment Subject R1 R2 R3 R4
1 1 28 29 59 55
1 3 27 24 50 50
1 5 30 30 61 50
1 7 26 30 60 60
1 10 30 30 65 65
2 2 34 34 61 61
2 4 25 25 49 48
2 8 26 26 55 48
2 9 20 20 60 60
The second dataset looks like this:
Subject Experiment R NewColumn
1 1 3
1 1 3
1 1 3
1 1 3
1 1 3
1 1 4
1 1 4
1 1 4
1 1 4
1 1 4
1 1 1
1 1 1
1 1 1
1 1 1
1 1 1
1 1 2
1 1 2
1 1 2
1 1 2
1 1 2
2 2 4
2 2 4
2 2 4
2 2 4
2 2 4
2 2 3
2 2 3
2 2 3
2 2 3
2 2 3
So, basicly I am trying to create a script or use a function that copies the values of R1-R4 of the first dataset into the 'NewColumn' of the second dataset, given that Experiment, Subject, and R (1-4) match.
I have tried to create a solution using loops and if statements, but unfortunately without succes.
Edit:
I think I should add that the second dataset contains (many) more variables (columns, which I left out for this example), is quite long (about 2000 rows) and is not ordered (Experiment, Subject and 'R' don't follow a logical order).
So my thought is, that the script should 'read' the variables 'Experiment' 'Subject' and 'R' from the second dataset, and paste the corresponding value from the first dataset (e.g. Experiment 1, Subject 1, R3) into the 'NewColumn' column. Many thanks for all of your input so far!
Any advice on how to solve this is very much appreciated.
We could use gather from tidyr to reshape the first dataset ('df1') from 'wide' to 'long' format. We create key/val columns ('Var', 'NewCol') from the R1:R4 columns. Then we split the 'Var' column into two new columns ('V1', 'R') using extract, left_join with 'df2' by specifying the common columns, and select the columns that are needed in the output.
library(dplyr)
library(tidyr)
gather(df1, Var, NewCol, R1:R4) %>%
extract(Var, into=c('V1', 'R'), '(.)(.)', convert=TRUE) %>%
left_join(df2, ., by=c('Subject', 'Experiment', 'R')) %>%
select(-V1)
# Subject Experiment R NewCol
#1 1 1 3 59
#2 1 1 3 59
#3 1 1 3 59
#4 1 1 3 59
#5 1 1 3 59
#6 1 1 4 55
#7 1 1 4 55
#8 1 1 4 55
#9 1 1 4 55
#10 1 1 4 55
#11 1 1 1 28
#12 1 1 1 28
#13 1 1 1 28
#14 1 1 1 28
#15 1 1 1 28
#16 1 1 2 29
#17 1 1 2 29
#18 1 1 2 29
#19 1 1 2 29
#20 1 1 2 29
#21 2 2 4 61
#22 2 2 4 61
#23 2 2 4 61
#24 2 2 4 61
#25 2 2 4 61
#26 2 2 3 61
#27 2 2 3 61
#28 2 2 3 61
#29 2 2 3 61
#30 2 2 3 61
data
df1 <- structure(list(Experiment = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L), Subject = c(1L, 3L, 5L, 7L, 10L, 2L, 4L, 8L, 9L), R1 = c(28L,
27L, 30L, 26L, 30L, 34L, 25L, 26L, 20L), R2 = c(29L, 24L, 30L,
30L, 30L, 34L, 25L, 26L, 20L), R3 = c(59L, 50L, 61L, 60L, 65L,
61L, 49L, 55L, 60L), R4 = c(55L, 50L, 50L, 60L, 65L, 61L, 48L,
48L, 60L)), .Names = c("Experiment", "Subject", "R1", "R2", "R3",
"R4"), class = "data.frame", row.names = c(NA, -9L))
df2 <- structure(list(Subject = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L), Experiment = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L), R = c(3L, 3L, 3L, 3L, 3L, 4L, 4L,
4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 4L, 4L, 4L,
4L, 4L, 3L, 3L, 3L, 3L, 3L)), .Names = c("Subject", "Experiment",
"R"), class = "data.frame", row.names = c(NA, -30L))
maybe like this ?
library(reshape)
df<- data.frame(Experiment=c(1,1),Subject=c(1,3),R1=c(28,27),R2=c(29,24),R3=c(59,50),R4=c(55,50))
> df
Experiment Subject R1 R2 R3 R4
1 1 1 28 29 59 55
2 1 3 27 24 50 50
dfc <- melt(df,id=c("Experiment","Subject"))
dfc # New Data
> dfc
Experiment Subject variable value
1 1 1 R1 28
2 1 3 R1 27
3 1 1 R2 29
4 1 3 R2 24
5 1 1 R3 59
6 1 3 R3 50
7 1 1 R4 55
8 1 3 R4 50

Finding value in one data.frame and transfering value from other column

I don't know if I will be able to explain it correctly but what I want to achieve really simple.
That's first data.frame. The important value for me is in first column "V1"
> dput(Data1)
structure(list(V1 = c(10L, 5L, 3L, 9L, 1L, 2L, 6L, 4L, 8L, 7L
), V2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "NA", class = "factor"),
V3 = c(18L, 17L, 13L, 20L, 15L, 12L, 16L, 11L, 14L, 19L)), .Names = c("V1",
"V2", "V3"), row.names = c(NA, -10L), class = "data.frame")
Second data.frame:
> dput(Data2)
structure(list(Names = c(9L, 10L, 6L, 4L, 2L, 7L, 5L, 3L, 1L,
8L), Herat = c(30L, 29L, 21L, 25L, 24L, 22L, 28L, 27L, 23L, 26L
), Grobpel = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "NA", class = "factor"), Hassynch = c(19L, 12L,
15L, 20L, 11L, 13L, 14L, 16L, 18L, 17L)), .Names = c("Names",
"Herat", "Grobpel", "Hassynch"), row.names = c(NA, -10L), class = "data.frame"
)
The value from first data.frame can be find in 1st column and I would like to copy the value from 4 column (Hassynch) and put it in the second column in first data.frame.
How to do it in the fastest way ?
library(dplyr)
left_join(Data1, Data2, by=c("V1"="Names"))
# V1 V2 V3 Herat Grobpel Hassynch
# 1 10 NA 18 29 NA 12
# 2 5 NA 17 28 NA 14
# 3 3 NA 13 27 NA 16
# 4 9 NA 20 30 NA 19
# 5 1 NA 15 23 NA 18
# 6 2 NA 12 24 NA 11
# 7 6 NA 16 21 NA 15
# 8 4 NA 11 25 NA 20
# 9 8 NA 14 26 NA 17
# 10 7 NA 19 22 NA 13
# if you don't want V2 and V3, you could
left_join(Data1, Data2, by=c("V1"="Names")) %>%
select(-V2, -V3)
# V1 Herat Grobpel Hassynch
# 1 10 29 NA 12
# 2 5 28 NA 14
# 3 3 27 NA 16
# 4 9 30 NA 19
# 5 1 23 NA 18
# 6 2 24 NA 11
# 7 6 21 NA 15
# 8 4 25 NA 20
# 9 8 26 NA 17
# 10 7 22 NA 13
Here's a toy example that I made some time ago to illustrate merge. left_join from dplyr is also good, and data.table almost certainly has another option.
You can subset your reference dataframe so that it contains only the key variable and value variable so that you don't end up with an unmanageable dataframe.
id<-as.numeric((1:5))
m<-c("a","a","a","","")
n<-c("","","b","b","b")
dfm<-data.frame(cbind(id,m))
head(dfm)
id m
1 1 a
2 2 a
3 3 a
4 4
5 5
dfn<-data.frame(cbind(id,n))
head(dfn)
id n
1 1
2 2
3 3 b
4 4 b
5 5 b
dfm$id<-as.numeric(dfm$id)
dfn$id<-as.numeric(dfn$id)
dfm<-subset(dfm,id<4)
head(dfm)
id m
1 1 a
2 2 a
3 3 a
dfn<-subset(dfn,id!=1 & id!=2)
head(dfn)
id n
3 3 b
4 4 b
5 5 b
df.all<-merge(dfm,dfn,by="id",all=TRUE)
head(df.all)
id m n
1 1 a <NA>
2 2 a <NA>
3 3 a b
4 4 <NA> b
5 5 <NA> b
df.all.m<-merge(dfm,dfn,by="id",all.x=TRUE)
head(df.al.lm)
id m n
1 1 a <NA>
2 2 a <NA>
3 3 a b
df.all.n<-merge(dfm,dfn,by="id",all.y=TRUE)
head(df.all.n)
id m n
1 3 a b
2 4 <NA> b
3 5 <NA> b

How to select data that have complete cases of a certain column?

I'm trying to get a data frame (just.samples.with.shoulder.values, say) contain only samples that have non-NA values. I've tried to accomplish this using the complete.cases function, but I imagine that I'm doing something wrong syntactically below:
data <- structure(list(Sample = 1:14, Head = c(1L, 0L, NA, 1L, 1L, 1L,
0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L), Shoulders = c(13L, 14L, NA,
18L, 10L, 24L, 53L, NA, 86L, 9L, 65L, 87L, 54L, 36L), Knees = c(1L,
1L, NA, 1L, 1L, 2L, 3L, 2L, 1L, NA, 2L, 3L, 4L, 3L), Toes = c(324L,
5L, NA, NA, 5L, 67L, 785L, 42562L, 554L, 456L, 7L, NA, 54L, NA
)), .Names = c("Sample", "Head", "Shoulders", "Knees", "Toes"
), class = "data.frame", row.names = c(NA, -14L))
just.samples.with.shoulder.values <- data[complete.cases(data[,"Shoulders"])]
print(just.samples.with.shoulder.values)
I would also be interested to know whether some other route (using subset(), say) is a wiser idea. Thanks so much for the help!
You can try complete.cases too which will return a logical vector which allow to subset the data by Shoulders
data[complete.cases(data$Shoulders), ]
# Sample Head Shoulders Knees Toes
# 1 1 1 13 1 324
# 2 2 0 14 1 5
# 4 4 1 18 1 NA
# 5 5 1 10 1 5
# 6 6 1 24 2 67
# 7 7 0 53 3 785
# 9 9 1 86 1 554
# 10 10 1 9 NA 456
# 11 11 1 65 2 7
# 12 12 1 87 3 NA
# 13 13 0 54 4 54
# 14 14 1 36 3 NA
You could try using is.na:
data[!is.na(data["Shoulders"]),]
Sample Head Shoulders Knees Toes
1 1 1 13 1 324
2 2 0 14 1 5
4 4 1 18 1 NA
5 5 1 10 1 5
6 6 1 24 2 67
7 7 0 53 3 785
9 9 1 86 1 554
10 10 1 9 NA 456
11 11 1 65 2 7
12 12 1 87 3 NA
13 13 0 54 4 54
14 14 1 36 3 NA
There is a subtle difference between using is.na and complete.cases.
is.na will remove actual na values whereas the objective here is to only control for a variable not deal with missing values/na's those which could be legitimate data points

Resources