How to find NAs in groups and create new column for data frame - r

I have a data frame consisting of an "ID" column and a "Diff" column. The ID column is responsible for marking groups of corresponding Diff values.
An example looks like this:
structure(list(ID = c(566, 566, 789, 789, 789, 487, 487, 11,
11, 189, 189), Diff = c(100, 277, 529, 43, NA, 860, 780, 445,
NA, 578, 810)), .Names = c("ID", "Diff"), row.names = c(9L, 10L,
20L, 21L, 22L, 25L, 26L, 51L, 52L, 62L, 63L), class = "data.frame")
My goal is to search each group for NAs in the Diff column and create a new column, that has either a "True" or "False" value for each row, depending if the corresponding group has an NA in Diff.
I tried
x <- aggregate(Diff ~ ID, data, is.na)
and
y <- aggregate(Diff ~ ID, data, function(x) any(is.na(x)))
The idea was to merge the result depending on ID. However, none of the above created a useful result. I know R can do it … and after searching for quite a while I ask you how :)

You can use the plyr and ddply
require(plyr)
ddply(data, .(ID), transform, na_diff = any(is.na(Diff)))
## ID Diff na_diff
## 1 11 445 TRUE
## 2 11 NA TRUE
## 3 189 578 FALSE
## 4 189 810 FALSE
## 5 487 860 FALSE
## 6 487 780 FALSE
## 7 566 100 FALSE
## 8 566 277 FALSE
## 9 789 529 TRUE
## 10 789 43 TRUE
## 11 789 NA TRUE

A very similar solution to #dickoa except in base:
do.call(rbind,by(data,data$ID,function(x)transform(x,na_diff=any(is.na(Diff)))))
# ID Diff na_diff
# 11.51 11 445 TRUE
# 11.52 11 NA TRUE
# 189.62 189 578 FALSE
# 189.63 189 810 FALSE
# 487.25 487 860 FALSE
# 487.26 487 780 FALSE
# 566.9 566 100 FALSE
# 566.10 566 277 FALSE
# 789.20 789 529 TRUE
# 789.21 789 43 TRUE
# 789.22 789 NA TRUE
Similarly, you could avoid transform with:
data$na_diff<-with(data,by(Diff,ID,function(x) any(is.na(x)))[as.character(ID)])

(You have two viable strategies already, but here is another which may be conceptually easier to follow if you are relatively new to R and aren't familiar with the way plyr works.)
I often need to know how many NAs I have in different variables, so here is a convenience function I use standard:
sna <- function(x){
sum(is.na(x))
}
From there, I sometimes use aggregate(), but sometimes I find ?summaryBy in the doBy package to be more convenient. Here's an example:
library(doBy)
z <- summaryBy(Diff~ID, data=my.data, FUN=sna)
z
ID Diff.sna
1 11 1
2 189 0
3 487 0
4 566 0
5 789 1
After this, you just need to use ?merge and convert the count of NAs to a logical to get your final data frame:
my.data <- merge(my.data, z, by="ID")
my.data$Diff.sna <- my.data$Diff.sna>0
my.data
ID Diff Diff.sna
1 11 445 TRUE
2 11 NA TRUE
3 189 578 FALSE
4 189 810 FALSE
5 487 860 FALSE
6 487 780 FALSE
7 566 100 FALSE
8 566 277 FALSE
9 789 529 TRUE
10 789 43 TRUE
11 789 NA TRUE

Related

Adding columns by splitting number, and removing duplicates

I have a dataframe like the following (this is a reduced example, I have many more rows and columns):
CH1 CH2 CH3
1 3434 282 7622
2 4442 6968 8430
3 4128 6947 478
4 6718 6716 3017
5 3735 9171 1128
6 65 4876 4875
7 9305 6944 3309
8 4283 6060 650
9 5588 2285 203
10 205 2345 9225
11 8634 4840 780
12 6383 0 1257
13 4533 7692 3760
14 9363 9846 4697
15 3892 79 4372
16 6130 5312 9651
17 7880 7386 6239
18 8515 8021 2295
19 1356 74 8467
20 9024 8626 4136
I need to create additional columns by splitting the values. For example, value 1356 would have to be split into 6, 56, and 356. I do this on a for loop splitting by string. I do this to keep the leading zeros. So far, decent.
# CREATE ADDITIONAL COLUMNS
for(col in 1:3) {
# Create a temporal variable
temp <- as.character(data[,col] )
# Save the new column
for(mod in c(-1, -2, -3)) {
# Create the column
temp <- cbind(temp, str_sub(as.character(data[,col]), mod))
}
# Merge to the row
data <- cbind(data, temp)
}
My problem is that not all cells have 4 digits: some may have 1, 2 or 3 digits. Therefore, I get repeated values when I split. For example, for 79 I get: 79 (original), 9, 79, 79, 79.
Problem: I need to remove the repeated values. Of course, I could do unique, but that gives me rows of uneven number of columns. I need to fill those missing (i.e. the removed repeated values) with NA. I can only compare this by row.
I checked CJ Yetman's answer here, but they only replace consecutive numbers. I only need to keep unique values.
Reproducible Example: Here is a fiddle with my code working: http://rextester.com/IKMP73407
Expected outcome: For example, for rows 11 & 12 of the example (see the link for the reproducible example), if this is my original:
8634 4 34 634 4840 0 40 840 780 0 80 780
6383 3 83 383 0 0 0 0 1257 7 57 257
I'd like to get this:
8634 4 34 634 4840 0 40 840 780 NA 80 NA
6383 3 83 383 0 NA NA NA 1257 7 57 257
You can use apply():
The data:
data <- structure(list(CH1 = c(3434L, 4442L, 4128L, 6718L, 3735L, 65L,
9305L, 4283L, 5588L, 205L, 8634L, 6383L, 4533L, 9363L, 3892L,
6130L, 7880L, 8515L, 1356L, 9024L), CH2 = c(282L, 6968L, 6947L,
6716L, 9171L, 4876L, 6944L, 6060L, 2285L, 2345L, 4840L, 0L, 7692L,
9846L, 79L, 5312L, 7386L, 8021L, 74L, 8626L), CH3 = c(7622L,
8430L, 478L, 3017L, 1128L, 4875L, 3309L, 650L, 203L, 9225L, 780L,
1257L, 3760L, 4697L, 4372L, 9651L, 6239L, 2295L, 8467L, 4136L
)), .Names = c("CH1", "CH2", "CH3"), row.names = c(NA, 20L), class = "data.frame")
Select row 11 and 12:
data <- data[11:12, ]
Using your code:
# CREATE ADDITIONAL COLUMNS
for(col in 1:3) {
# Create a temporal variable
temp <- data[,col]
# Save the new column
for(mod in c(10, 100, 1000)) {
# Create the column
temp <- cbind(temp, data[, col] %% mod)
}
data <- cbind(data, temp)
}
data[,1:3] <- NULL
The result is:
temp V2 V3 V4 temp V2 V3 V4 temp V2 V3 V4
11 8634 4 34 634 4840 0 40 840 780 0 80 780
12 6383 3 83 383 0 0 0 0 1257 7 57 257
Then go through the data row by row and remove duplicates and transpose the outcome:
t(apply(data, 1, function(row) {
row[duplicated(row)] <- NA
return(row)
}))
The result is:
temp V2 V3 V4 temp V2 V3 V4 temp V2 V3 V4
11 8634 4 34 634 4840 0 40 840 780 NA 80 NA
12 6383 3 83 383 0 NA NA NA 1257 7 57 257

Subsetting to drop rows where df$var=0 produces NA rows where var is NA

I have a data.frame that I'm attempting to eliminate some observations on. I want to drop any row in which out$SUB_AGE is equal to 0. However, when I try to subset my df based on that condition, it transforms any row that has NA for out$SUB_AGE into a row of NAs. I've provided a dput below which doesn't actually contain any rows where out$SUB_AGE=0 but it does behave exactly the same as the full dataset which does contain zeroes does.
# dput the data
> temp <- dput(droplevels(out[1:12, 1:4]))
structure(list(SUB_ID = c(5998784L, 6805295L, 318926L, 1270965L,
1635543L, 4296301L, 1001498L, 2388387L, 2190957L, 4168048L, 318926L,
4073180L), ORG_ID = c(10861L, 17361L, 10608L, 11099L, 13135L,
14803L, 12359L, 13151L, 13135L, 17252L, 10608L, 17317L), SUB_AGE = c(36,
NA, NA, 40, 60, 50, 52, 61, 56, 62, NA, NA), SUB_SEX = c(NA,
1, 2, 1, 2, 2, 1, 2, 2, NA, 2, 2)), .Names = c("SUB_ID", "ORG_ID",
"SUB_AGE", "SUB_SEX"), row.names = c(107L, 190L, 242L, 331L,
361L, 447L, 455L, 591L, 663L, 664L, 731L, 732L), class = "data.frame")
# table before subsetting
SUB_ID ORG_ID SUB_AGE SUB_SEX
107 5998784 10861 36 NA
190 6805295 17361 NA 1
242 318926 10608 NA 2
331 1270965 11099 40 1
361 1635543 13135 60 2
447 4296301 14803 50 2
455 1001498 12359 52 1
591 2388387 13151 61 2
663 2190957 13135 56 2
664 4168048 17252 62 NA
731 318926 10608 NA 2
732 4073180 17317 NA 2
# code to subset
temp <- temp[temp$SUB_AGE != 0,]
# table after subsetting
SUB_ID ORG_ID SUB_AGE SUB_SEX
107 5998784 10861 36 NA
NA NA NA NA NA
NA.1 NA NA NA NA
331 1270965 11099 40 1
361 1635543 13135 60 2
447 4296301 14803 50 2
455 1001498 12359 52 1
591 2388387 13151 61 2
663 2190957 13135 56 2
664 4168048 17252 62 NA
NA.2 NA NA NA NA
NA.3 NA NA NA NA
I'm sure there's something simple I'm missing here but I racked my brain and apparently couldn't come up with the right combination of keywords to figure it out myself.
To understand the problem, try printing temp$SUB_AGE != 0:
[1] TRUE NA NA TRUE TRUE TRUE TRUE TRUE TRUE TRUE NA NA
You're using this vector to subset temp, but that functionality only works for TRUE/FALSE values. If you want to keep all the rows with NA values, you can add an extra condition:
temp[temp$SUB_AGE != 0 | is.na(temp$SUB_AGE),]

How many categories there are in a column in a list of data frame?

I have a list of data frames where the index indicates where one family ends and another begins. I would like to know how many categories there are in statepath column in each family.
In my below example I have two families, then I am trying to get a table wiht the frequency of each statepath category (233, 434, 323, etc) in each family.
My input:
List <-
'$`1`
Chr Start End Family Statepath
1 187546286 187552094 father 233
3 108028534 108032021 father 434
1 4864403 4878685 mother 323
1 18898657 18904908 mother 322
2 460238 461771 offspring 322
3 108028534 108032021 offspring 434
$’2’
Chr Start End Family Statepath
1 71481449 71532983 father 535
2 74507242 74511395 father 233
2 181864092 181864690 mother 322
1 71481449 71532983 offspring 535
2 181864092 181864690 offspring 322
3 160057791 160113642 offspring 335'
Thus, my expected output Freq_statepath would look like:
Freq_statepath <- ‘Statepath Family_1 Family_2
233 1 1
434 2 0
323 1 0
322 2 2
535 0 2
335 0 1’
I think you want something like this:
test <- list(data.frame(Statepath = c(233,434,323,322,322)),data.frame(Statepath = c(434,323,322,322)))
list_tables <- lapply(test, function(x) data.frame(table(x$Statepath)))
final_result <- Reduce(function(...) merge(..., by.x = "Var1", by.y = "Var1", all.x = T, all.y = T), list_tables)
final_result[is.na(final_result)] <- 0
> test
[[1]]
Statepath
1 233
2 434
3 323
4 322
5 322
[[2]]
Statepath
1 434
2 323
3 322
4 322
> final_result
Var1 Freq.x Freq.y
1 233 1 0
2 322 2 2
3 323 1 1
4 434 1 1

Breaking down a timed sequence into episodes

I'm trying to break down a vector of event times into episodes. An episode must meet 2 criteria. 1) It consists of 3 or more events and 2) those events have inter-event times of 25 time units or less. My data is organized in a data frame as shown below.
So far, I figured out that I can find the difference between events with diff(EventTime). By creating a logical vector that corresponds to events that the 2nd inter-event criterion, I can use rle(EpisodeTimeCriterion) to get a the total number, and length of episodes.
EventTime TimeDifferenceBetweenNextEvent EpisodeTimeCriterion
25 NA NA
75 50 TRUE
100 25 TRUE
101 1 TRUE
105 4 TRUE
157 52 FALSE
158 1 TRUE
160 2 TRUE
167 7 TRUE
169 2 TRUE
170 1 TRUE
175 5 TRUE
178 3 TRUE
278 100 FALSE
302 24 TRUE
308 6 TRUE
320 12 TRUE
322 459 FALSE
However, I would like to know the timing of the episodes and 'rle()' doesn’t let me do that.
Ideally I would like to generate a data frame that looks like this:
Episode EventsPerEpisode EpisodeStartTime EpisodeEndTime
1 4 75 105
2 7 158 178
3 3 302 322
I know that this is probably a simple problem but being new to R, the only solution I can envision is some series of loops. Is there a way of doing this without loops? Or is there package that lends itself to this sort of analysis?
Thanks!
Edited for clarity. Added a desired outcome data fame and expanded the example data to make it clearer.
You've got the pieces you need. You really need to make a variable that gives each episode a number/name so you can group by it. rle(...)$length gives you run lengths, so just use rep to repeat a number that number of times:
runs <- rle(df$EpisodeTimeCriterion)$lengths # You don't need this extra variable, but it makes the code more readable
df$Episode <- rep(1:length(runs), runs)
so df looks like
> head(df)
EventTime TimeDifferenceBetweenNextEvent EpisodeTimeCriterion Episode
1 25 NA NA 1
2 75 50 TRUE 2
3 100 25 TRUE 2
4 101 1 TRUE 2
5 105 4 TRUE 2
6 157 52 FALSE 3
Now use dplyr to summarize the data:
library(dplyr)
df2 <- df %>% filter(EpisodeTimeCriterion) %>% group_by(Episode) %>%
summarise(EventsPerEpisode = n(),
EpisodeStartTime = min(EventTime),
EpisodeEndTime = max(EventTime))
which returns
> df2
Source: local data frame [3 x 4]
Episode EventsPerEpisode EpisodeStartTime EpisodeEndTime
(int) (int) (dbl) (dbl)
1 2 4 75 105
2 4 7 158 178
3 6 3 302 320
If you want your episode numbers to be integers starting with one, you can clean up with
df2$Episode <- 1:nrow(df2)
Data
If someone wants to play with the data, the results of dput(df) before running the above code:
df <- structure(list(EventTime = c(25, 75, 100, 101, 105, 157, 158,
160, 167, 169, 170, 175, 178, 278, 302, 308, 320, 322), TimeDifferenceBetweenNextEvent = c(NA,
50, 25, 1, 4, 52, 1, 2, 7, 2, 1, 5, 3, 100, 24, 6, 12, 459),
EpisodeTimeCriterion = c(NA, TRUE, TRUE, TRUE, TRUE, FALSE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE,
TRUE, FALSE)), .Names = c("EventTime", "TimeDifferenceBetweenNextEvent",
"EpisodeTimeCriterion"), row.names = c(NA, -18L), class = "data.frame")
Here is one approach I came up with using a combination of cut2 from Hmisc package and cumsum to label episodes into numbers:
library(Hmisc)
library(dplyr)
df$episodeCut <- cut2(df$TimeDifferenceBetweenNextEvent, c(26))
df$episode <- cumsum((df$episodeCut == '[ 1,26)' & lag(df$episodeCut) != '[ 1,26)') | df$episodeCut != '[ 1,26)')
Output is as follows:
EventTime TimeDifferenceBetweenNextEvent EpisodeTimeCriterion episodeCut episode
1 25 50 FALSE [26,52] 1
2 75 25 TRUE [ 1,26) 2
3 100 1 TRUE [ 1,26) 2
4 101 4 TRUE [ 1,26) 2
5 105 52 TRUE [26,52] 3
6 157 52 FALSE [26,52] 4
As you can see, it tags rows 2, 3, 4 as belonging to a single episode.
Is this what you are looking for? Not sure from your description. So, my answer may be wrong.

Apply a rule to calculate sum of specific

Hi I have a data set like this.
Num C Pr Value Volume
111 aa Alen 111 222
111 aa Paul 100 200
222 vv Iva 444 555
222 vv John 333 444
I would like to filter the data according to Num and to add a new row where take the sum of column Value and Volume but to keep the information of column Num and C, but in column Pr to put Total. It should look like this way.
Num C Pr Value Volume
222 vv Total 777 999
Could you suggest me how to do it? I would like only for Num 222.
When I try to use res command I end up with this result.
# Num C Pr Value Volume
1: 111 aa Alen 111 222
2: 111 aa Paul 100 200
3: 111 aa Total NA NA
4: 222 vv Iva 444 555
5: 222 vv John 333 444
6: 222 vv Total NA NA
What cause this?
The structure of my data is the following one.
'data.frame': 4 obs. of 5 variables:
$ Num : Factor w/ 2 levels "111","222": 1 1 2 2
$ C : Factor w/ 2 levels "aa","vv": 1 1 2 2
$ Pr : Factor w/ 4 levels "Alen","Iva","John",..: 1 4 2 3
$ Value : Factor w/ 4 levels "100","111","333",..: 2 1 4 3
$ Volume: Factor w/ 4 levels "200","222","444",..: 2 1 4 3
We could use data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'Num', 'C' columns and specifying the columns to do the sum in .SDcols, we loop those columns using lapply, get the sum, and create the 'Pr' column. We can rbind the original dataset with the new summarised output ('DT1') and order the result based on 'Num'.
library(data.table)#v1.9.5+
DT1 <- setDT(df1)[,lapply(.SD, sum) , by = .(Num,C),
.SDcols=Value:Volume][,Pr:='Total'][]
rbind(df1, DT1)[order(Num)]
# Num C Pr Value Volume
#1: 111 aa Alen 111 222
#2: 111 aa Paul 100 200
#3: 111 aa Total 211 422
#4: 222 vv Iva 444 555
#5: 222 vv John 333 444
#6: 222 vv Total 777 999
This can be done using base R methods as well. We get the sum of 'Value', 'Volume' columns grouped by 'Num', 'C', using the formula method of aggregate, transform the output by creating the 'Pr' column, rbind with original dataset and order the output ('res') based on 'Num'.
res <- rbind(df1,transform(aggregate(.~Num+C, df1[-3], FUN=sum), Pr='Total'))
res[order(res$Num),]
# Num C Pr Value Volume
#1 111 aa Alen 111 222
#2 111 aa Paul 100 200
#5 111 aa Total 211 422
#3 222 vv Iva 444 555
#4 222 vv John 333 444
#6 222 vv Total 777 999
EDIT: Noticed that the OP mentioned filter. If this is for a single 'Num', we subset the data, and then do the aggregate, transform steps.
transform(aggregate(.~Num+C, subset(df1, Num==222)[-3], FUN=sum), Pr='Total')
# Num C Value Volume Pr
#1 222 vv 777 999 Total
Or we may not need aggregate. After subsetting the data, we convert the 'Num' to 'factor', loop through the output dataset ('df2') get the sum if it the column is numeric class or else we get the first element and wrap with data.frame.
df2 <- transform(subset(df1, Num==222), Num=factor(Num))
data.frame(c(lapply(df2[-3], function(x) if(is.numeric(x))
sum(x) else x[1]), Pr='Total'))
# Num C Value Volume Pr
#1 222 vv 777 999 Total
data
df1 <- structure(list(Num = c(111L, 111L, 222L, 222L), C = c("aa", "aa",
"vv", "vv"), Pr = c("Alen", "Paul", "Iva", "John"), Value = c(111L,
100L, 444L, 333L), Volume = c(222L, 200L, 555L, 444L)), .Names = c("Num",
"C", "Pr", "Value", "Volume"), class = "data.frame",
row.names = c(NA, -4L))
Or using dplyr:
library(dplyr)
df1 %>%
filter(Num == 222) %>%
summarise(Value = sum(Value),
Volume = sum(Volume),
Pr = 'Total',
Num = Num[1],
C = C[1])
# Value Volume Pr Num C
# 1 777 999 Total 222 vv
where we first filter to keep only Num == 222, and then use summarise to obtain the sums and the values for Num and C. This assumes that:
You do not want to get the result for each unique Num (I select one here, you could select multiple). If you need this, use group_by.
There is only ever one C for every unique Num.
You can also use a dplyr package:
df %>%
filter(Num == 222) %>%
group_by(Num, C) %>%
summarise(
Pr = "Total"
, Value = sum(Value)
, Volume = sum(Volume)
) %>%
rbind(df, .)
# Num C Pr Value Volume
# 1 111 aa Alen 111 222
# 2 111 aa Paul 100 200
# 3 222 vv Iva 444 555
# 4 222 vv John 333 444
# 5 222 vv Total 777 999
If you want the total for each Num value you just comment a filter line

Resources