R - ddply summarise using nlevels() does not work - r

When using the plyr package to summarise my data, it seems impossible to use the nlevels() function.
The structure of my data set is as follows:
>aer <- read.xlsx("XXXX.xlsx", sheetIndex=1)
>aer$ID <- as.factor(aer$ID)
>aer$description <- as.factor(aer$description)
>head(aer)
ID SOC start end days count severity relation
1 1 410 2015-04-21 2015-04-28 7 1 1 3
2 1 500 2015-01-30 2015-05-04 94 1 1 3
3 1 600 2014-11-25 2014-11-29 4 1 1 3
4 1 600 2015-01-02 2015-01-07 5 1 1 3
5 1 600 2015-01-26 2015-03-02 35 1 1 3
6 1 600 2015-04-14 2015-04-17 3 1 1 3
> dput(head(aer,4))
structure(list(ID = structure(c(1L, 1L, 1L, 1L), .Label = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "12", "13", "14",
"15"), class = "factor"), SOC = c(410, 500, 600, 600),
start = structure(c(16546, 16465, 16399, 16437), class = "Date"),
end = structure(c(16553, 16559, 16403, 16442), class = "Date"),
days = c(7, 94, 4, 5), count = c(1, 1, 1, 1), severity = c(1,
1, 1, 1), relation = c(3, 3, 3, 3)), .Names = c("ID", "SOC",
"description", "start", "end", "days", "count", "severity", "relation"
), row.names = c(NA, 4L), class = "data.frame")
What I would like to know is how many levels exists in the "ID" variable in data sections created, when dividing the data set using the variable "SOC". I want to summarise this information together with some other variables in a new data set. Therefore, I would like to use the plyr package like so:
summaer2 <- ddply(aer, c("SOC"), summarise,
participants = nlevels(ID),
events = sum(count),
min_duration = min(days),
max_duration = max(days),
max_severity = max(severity))
This returns the following error:
Error in Summary.factor(c(4L, 5L, 11L, 11L, 14L, 14L), na.rm = FALSE) :
‘max’ not meaningful for factors
Could someone give me advice on how to reach my goal? Or what I'm doing wrong?
Many thanks in advance!

Update:
Substituting nlevels(ID) with length(unique(ID)) seems to give me the desired output:
> head(summaer2)
SOC participants events min_duration max_duration max_severity
1 100 4 7 1 62 2
2 410 9 16 1 41 2
3 431 2 2 109 132 1
4 500 5 9 23 125 2
5 600 8 19 1 35 1
6 1040 1 1 98 98 2

Related

Cann't remove NAs

I am trying to delete rows in my dataset, which contains NAs, but none of the functions work, What could be a reason?
Here is sample of my code,
Site_cov<- read.csv("site_cov.csv")
colnames(Site_cov)<- c("Point", "Basal", "Short.Saps", "Tall.Saps")
head(Site_cov)
Point Basal Short.Saps Tall.Saps
1 DEL001 Na 2 0
2 DEL002 Na 1 6
3 DEL003 Na 0 5
4 DEL004 10 21 22
Here, I though that upper and lower case Nas, could be a problem and this is what I run,
Site_cov$Basal<-toupper(Site_cov$Basal)
Site_cov$Short.Saps<-toupper(Site_cov$Short.Saps)
Site_cov$Tall.Saps<-toupper(Site_cov$Tall.Saps)
Then, I try to delete NAs
Site_cov_NA <- Site_cov[complete.cases(Site_cov[ , c("Point", "Basal", "Short.Saps", "Tall.Saps")]), ]
But, NAs are still here
head(Site_cov_NA)
Point Basal Short.Saps Tall.Saps
1 DEL001 NA 2 0
2 DEL002 NA 1 6
3 DEL003 NA 0 5
4 DEL004 10 21 22
5 DEL005 60 8 17
6 DEL006 80 17 13
Obviously you have 'Na' strings that are fake NAs. replace them with real ones, then your code should work.
dat <- replace(dat, dat == 'Na', NA)
dat[complete.cases(dat[, c("Point", "Basal", "Short.Saps", "Tall.Saps")]), ]
# Point Basal Short.Saps Tall.Saps
# 4 DEL004 10 21 22
Data:
dat <- structure(list(Point = c("DEL001", "DEL002", "DEL003", "DEL004"
), Basal = c("Na", "Na", "Na", "10"), Short.Saps = c(2L, 1L,
0L, 21L), Tall.Saps = c(0L, 6L, 5L, 22L)), class = "data.frame", row.names = c("1",
"2", "3", "4"))
Try the complete.cases() function (https://stat.ethz.ch/R-manual/R-patched/library/stats/html/complete.cases.html)
try <- data.frame("a"=c(1,3,NA,NA), "b"=c(3,5,2,3))
try1<-try[complete.cases(try),]
try1

How to replace all variable names with the contents of the first row in a tibble

Is there a quick way to replace variable names with the content of the first row of a tibble?
So turning something like this:
Subject Q1 Q2 Q3
Subject age gender cue
429753 24 1 man
b952x8 23 2 mushroom
264062 19 1 night
53082m 35 1 moon
Into this:
Subject age gender cue
429753 24 1 man
b952x8 23 2 mushroom
264062 19 1 night
53082m 35 1 moon
My dataset has over 100 variables so I'm looking for a way that doesn't involve typing out each old and new variable name.
A possible solution:
df <- structure(list(Subject = c("Subject", "429753", "b952x8", "264062",
"53082m"), Q1 = c("age", "24", "23", "19", "35"), Q2 = c("gender",
"1", "2", "1", "1"), Q3 = c("cue", "man", "mushroom", "night",
"moon")), row.names = c(NA, -5L), class = "data.frame")
names(df) <- df[1,]
df <- df[-1,]
df
#> Subject age gender cue
#> 2 429753 24 1 man
#> 3 b952x8 23 2 mushroom
#> 4 264062 19 1 night
#> 5 53082m 35 1 moon

Calculate value using row with pattern

I have these inputs:
df
# A tibble: 53 x 2
Task Frame
<chr> <int>
1 S101-10061 6
2 S101-10061-74716 16
3 S101-10065 18
4 S101-10065-104934 16
5 S101-10071 32
6 S101-10071-104898 74
7 S101-10072 8
8 S101-10072-79124 58
9 S101-10074 38
10 S101-10075 82
As you see in same "Task" first 10 characters sometimes is same. So I need to find that task is a same, for example task 1 (S101-10061) as same task 2 (S101-10061-74716) in first 10 characters, and if that same find abs difference from number of frame, here in example 16-6=10. So I expect something like:
Task Frame Diff
<chr> <int> <int>
1 S101-10061 6 6
2 S101-10061-74716 16
3 S101-10065 18 2
4 S101-10065-104934 16
5 S101-10071 32 24
6 S101-10071-104898 74
7 S101-10072 8 50
8 S101-10072-79124 58
9 S101-10074 38
10 S101-10075 82
I tried:
df %>% mutate(
Diff = accumulate(
Frame[1:n()], function(x,y)(abs(x-y))
)
)
But its doesn't help, how to compare row by pattern? any ideas?
Here is a dplyr solution.
library(dplyr)
df %>%
mutate(Task = substr(Task, 1, 10)) %>%
group_by(Task) %>%
mutate(Diff = abs(Frame - lead(Frame)))
Data.
df <-
structure(list(Task = c("S101-10061", "S101-10061-74716", "S101-10065",
"S101-10065-104934", "S101-10071", "S101-10071-104898", "S101-10072",
"S101-10072-79124", "S101-10074", "S101-10075"), Frame = c(6L,
16L, 18L, 16L, 32L, 74L, 8L, 58L, 38L, 82L)), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))

Create new column using condition on another existing column

I have data like this
Time chamber
9 1
10 2
11 3
12 4
13 5
14 6
15 7
16 8
17 9
18 10
19 11
20 12
21 1
22 2
23 3
24 4
I want to create a new column using conditions on another existing column (chamber).
It should look something like this
Time chamber treatment
9 1 c2t2
10 2 c2t2
11 3 c0t0r
12 4 c2t2r
13 5 c2t2r
14 6 c0t0
15 7 c0t0r
16 8 c0t0r
17 9 c2t2
18 10 c2t2r
19 11 c0t0
20 12 c0t0
21 1 c2t2
22 2 c2t2
23 3 c0t0r
24 4 c2t2r
For chambers 1,2,9: Treatment is c2t2
For chambers 3,7,8: Treatment is c0t0r.
For chambers 4,5,10: Treatment is c2t2r
For chambers 6,11,12: Treatment is c0t0.
I have also made a lookup table, but I don't know how to use it:
lookup_table <- data.frame(row.names = c("1", "2", "3","4", "5", "6","7", "8", "9","10", "11", "12"),
new_col = c("C2T2", "C2T2", "C0T0R","C2T2R", "C2T2R", "C0T0","C0T0R", "C0T0R", "C2T2","C2T2R", "C0T0", "C0T0"),
stringsAsFactors = FALSE)
Assuming "dt" is your dataframe name, then you can use dplyr with case_when
library(tidyverse)
dt %>%
mutate(newcol = case_when(dt$chamber %in% c(1, 2, 9) ~ "c2t2",
dt$chamber %in% c(3, 7, 8) ~ "c0t0r",
dt$chamber %in% c(4, 5, 10) ~ "c2t2r",
dt$chamber %in% c(6, 11, 12) ~ "c0t0"))
Output:
Time chamber newcol
1 9 1 c2t2
2 10 2 c2t2
3 11 3 c0t0r
4 12 4 c2t2r
5 13 5 c2t2r
6 14 6 c0t0
7 15 7 c0t0r
8 16 8 c0t0r
9 17 9 c2t2
10 18 10 c2t2r
11 19 11 c0t0
12 20 12 c0t0
13 21 1 c2t2
14 22 2 c2t2
15 23 3 c0t0r
16 24 4 c2t2r
>
You can merge your df with the lookup_table. In my experience, if you want to combine different data.frames, merge() is the command I like to use. Do note that there are many different ways and specialised packages you can use for the same purpose!
You need to specify which column you use as the 'matching column' and also that you want to keep all records in df:
merge(df, lookup_table, all.x = TRUE, by.x = "chamber", by.y = "row.names")
Data:
df <- structure(list(Time = 9:24, chamber = c(1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L)),
.Names = c("Time", "chamber"), class = "data.frame",
row.names = c(NA, -16L))
lookup_table <- structure(list(new_col = c("C2T2", "C2T2", "C0T0R", "C2T2R",
"C2T2R", "C0T0", "C0T0R", "C0T0R",
"C2T2", "C2T2R", "C0T0", "C0T0")),
.Names = "new_col",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"), class = "data.frame")

Concatenate/paste strings of varying length according to index

I would like to know how to concatenate string to form sequences of different and varying lengths & varying content according to one condition.
Here is a dataframe example (my DF is actually about 60000 rows).
column index: just an index
to_concat: the string item i want to concatenate
max_seq: one example of the condition for concatenation (to_concat should only concatenate if it is part of the same sequence - and I have indicated the position of the string in the sequence for now.
concat_result: The result I would like to have
index to_concat max_seq concat_result
1 Abc! 1 <abc!+def+_>
2 def 2 <abc!+def+_>
3 _ 3 <abc!+def+_>
4 x93 1 <x93+afza+5609+5609+Abc!+def>
5 afza 2 <x93+afza+5609+5609+Abc!+def>
6 5609 3 <x93+afza+5609+5609+Abc!+def>
7 5609 4 <x93+afza+5609+5609+Abc!+def>
8 Abc! 5 <x93+afza+5609+5609+Abc!+def>
9 def 6 <x93+afza+5609+5609+Abc!+def>
10 _ 1 <_+x93+afza>
11 x93 2 <_+x93+afza>
12 afza 3 <_+x93+afza>
I know of paste & aggregate, length, probably usefull.. But do not see in which order to do that and especially how to formulate the paste.
I suppose I should also include an "second" index better done for max_seq (such as : all strings to be concatenated in the same sequence have the same number so here we would have a 3 sequences " 1 1 1 2 2 2 2 2 2 3 3 3 ".
But I do not know if that is the quickest/easiest solution and also I do not know how to paste varying length...
Could you please help a fellow PhD? Thanks a lot in advance.
Reproductible example:
dput(dat)
> dput(dat)
structure(list(V1 = c("index", "1", "2", "3", "4", "5", "6",
"7", "8", "9", "10", "11", "12"), V2 = c("to_concat", "Abc!",
"def", "_", "x93", "afza", "5609", "5609", "Abc!", "def", "_",
"x93", "afza"), V3 = c("max_seq", "1", "2", "3", "1", "2", "3",
"4", "5", "6", "1", "2", "3"), V4 = c("concat_result", "<abc!+def+_>",
"<abc!+def+_>", "<abc!+def+_>", "<x93+afza+5609+5609+Abc!+def>",
"<x93+afza+5609+5609+Abc!+def>", "<x93+afza+5609+5609+Abc!+def>",
"<x93+afza+5609+5609+Abc!+def>", "<x93+afza+5609+5609+Abc!+def>",
"<x93+afza+5609+5609+Abc!+def>", "<_+x93+afza>", "<_+x93+afza>",
"<_+x93+afza>")), .Names = c("V1", "V2", "V3", "V4"), class = "data.frame", row.names = c(NA,
-13L))
Several options to get the desired result:
1) Using base R:
mydf$grp <- cumsum(mydf$max_seq < c(1,head(mydf$max_seq, -1))) + 1
mydf$concat_result <- ave(mydf$to_concat, mydf$grp,
FUN = function(x) paste0('<',paste(x,collapse='+'),'>'))
which gives:
> mydf
index to_concat max_seq grp concat_result
1 1 Abc! 1 1 <Abc!+def+_>
2 2 def 2 1 <Abc!+def+_>
3 3 _ 3 1 <Abc!+def+_>
4 4 x93 1 2 <x93+afza+5609+5609+Abc!+def>
5 5 afza 2 2 <x93+afza+5609+5609+Abc!+def>
6 6 5609 3 2 <x93+afza+5609+5609+Abc!+def>
7 7 5609 4 2 <x93+afza+5609+5609+Abc!+def>
8 8 Abc! 5 2 <x93+afza+5609+5609+Abc!+def>
9 9 def 6 2 <x93+afza+5609+5609+Abc!+def>
10 10 _ 1 3 <_+x93+afza>
11 11 x93 2 3 <_+x93+afza>
12 12 afza 3 3 <_+x93+afza>
2) Or using the data.table package:
library(data.table)
setDT(mydf)[, grp := cumsum(max_seq < shift(max_seq, fill = 0))+1
][, concat_result := paste0('<',paste(to_concat,collapse='+'),'>'), grp][]
3) Or using the dplyr package:
library(dplyr)
mydf %>%
mutate(grp = cumsum(max_seq < lag(max_seq, n=1, default=0))+1) %>%
group_by(grp) %>%
mutate(concat_result = paste0('<',paste(to_concat,collapse='+'),'>'))
Used data:
mydf <- structure(list(index = 1:12,
to_concat = c("Abc!", "def", "_", "x93", "afza", "5609", "5609", "Abc!", "def", "_", "x93", "afza"),
max_seq = c(1L, 2L, 3L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L)),
.Names = c("index", "to_concat", "max_seq"), class = "data.frame", row.names = c(NA, -12L))

Resources