I am dealing with a dataset like this
Id Value Date
1 250 NA
1 250 2010-06-21
2 6 NA
2 6 2012-08-23
3 545 NA
7 3310 NA
My goal is to remove entire rows if there is an NA in Date column and ID is duplicate. The final output should look like:
Id Value Date
1 250 2010-06-21
2 6 2012-08-23
3 545 NA
7 3310 NA
df1[!(is.na(df1$Date) & duplicated(df1$Id) | duplicated(df1$Id, fromLast = TRUE)),]
# Id Value Date
#2 1 250 2010-06-21
#4 2 6 2012-08-23
#5 3 545 <NA>
#6 7 3310 <NA>
DATA
df1 = structure(list(Id = c(1L, 1L, 2L, 2L, 3L, 7L), Value = c(250L,
250L, 6L, 6L, 545L, 3310L), Date = c(NA, "2010-06-21", NA, "2012-08-23",
NA, NA)), .Names = c("Id", "Value", "Date"), class = "data.frame", row.names = c(NA,
-6L))
Related
I have a data looks like this but way much bigger
df<- structure(list(names = c("bests-1", "trible-1", "crazy-1", "cool-1",
"nonsense-1", "Mean-1", "Lose-1", "Trye-1", "Trified-1"), Col = c(1L,
2L, NA, 4L, 47L, 294L, 2L, 1L, 3L), col2 = c(2L, 4L, 5L, 7L,
9L, 9L, 0L, 2L, 3L)), class = "data.frame", row.names = c(NA,
-9L))
as an example, I am trying to remove -1 from all strings of the first column
I can do this with
as.data.frame(str_remove_all(df$names, "-1"))
the problem is that it will remove all other columns as well.
I dont want to split the data and merge again because I am afraid I Make a mismatch
Is there anyway without interrupting, just getting raid of specific strings?
for instance the output should looks like this
names Col col2
bests 1 2
trible 2 4
crazy NA 5
cool 4 7
nonsense 47 9
Mean 294 9
Lose 2 0
Try 1 2
Trified 3 3
Using gsub, escape the special \\-, and $ for end of string.
transform(df, names=gsub('\\-1$', '', names))
# names Col col2
# 1 bests 1 2
# 2 trible 2 4
# 3 crazy NA 5
# 4 cool 4 7
# 5 nonsense 47 9
# 6 Mean 294 9
# 7 Lose 2 0
# 8 Trye 1 2
# 9 Trified 3 3
Data:
df <- structure(list(names = c("bests-1", "trible-1", "crazy-1", "cool-1",
"nonsense-1", "Mean-1", "Lose-1", "Trye-1", "Trified-1"), Col = c(1L,
2L, NA, 4L, 47L, 294L, 2L, 1L, 3L), col2 = c(2L, 4L, 5L, 7L,
9L, 9L, 0L, 2L, 3L)), class = "data.frame", row.names = c(NA,
-9L))
Using stringr package,
df$names = str_remove_all(df$names, '-1')
names Col col2
1 bests 1 2
2 trible 2 4
3 crazy NA 5
4 cool 4 7
5 nonsense 47 9
6 Mean 294 9
7 Lose 2 0
8 Trye 1 2
9 Trified 3 3
We could use trimws from base R
df$names <- trimws(df$names, whitespace = "-\\d+")
-output
> df
names Col col2
1 bests 1 2
2 trible 2 4
3 crazy NA 5
4 cool 4 7
5 nonsense 47 9
6 Mean 294 9
7 Lose 2 0
8 Trye 1 2
9 Trified 3 3
I would like to know how to increasingly count the number of times that a column in my data.frame satisfies a condition. Let's consider a data.frame such as:
x hour count
1 0 NA
2 1 NA
3 2 NA
4 3 NA
5 0 NA
6 1 NA
...
I would like to have this output:
x hour count
1 0 1
2 1 NA
3 2 NA
4 3 NA
5 0 2
6 1 NA
...
With the count column increasing by 1 everytime the condition hour==0 is met.
Is there a smart and efficient way to perform this? Thanks
You can use seq_along on the rows where hour == 0.
i <- x$hour == 0
x$count[i] <- seq_along(i)
x
# x hour count
#1 1 0 1
#2 2 1 NA
#3 3 2 NA
#4 4 3 NA
#5 5 0 2
#6 6 1 NA
Data:
x <- structure(list(x = 1:6, hour = c(0L, 1L, 2L, 3L, 0L, 1L), count = c(NA,
NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-6L))
You can use cumsum to count incremental number of 0 occurrences and replace counts where hour values is not 0 to NA.
library(dplyr)
df %>%
mutate(count = cumsum(hour == 0),
count = replace(count, hour != 0 , NA))
# x hour count
#1 1 0 1
#2 2 1 NA
#3 3 2 NA
#4 4 3 NA
#5 5 0 2
#6 6 1 NA
data
df <- structure(list(x = 1:6, hour = c(0L, 1L, 2L, 3L, 0L, 1L)),
class = "data.frame", row.names = c(NA, -6L))
Using data.table
library(data.table)
setDT(df)[hour == 0, count := seq_len(.N)]
df
# x hour count
#1: 1 0 1
#2: 2 1 NA
#3: 3 2 NA
#4: 4 3 NA
#5: 5 0 2
#6: 6 1 NA
data
df <- structure(list(x = 1:6, hour = c(0L, 1L, 2L, 3L, 0L, 1L)),
class = "data.frame", row.names = c(NA, -6L))
I have the following dataframe in R:
Date A B C
1 2015-01-17 1 NA 1
2 2015-01-18 NA NA NA
3 2015-01-19 1 2 3
4 2015-01-19 1 NA 1
...
The goal is that different rows having the same date add their values in columns A,B,C:
Date A B C
1 2015-01-17 1 NA 1
2 2015-01-18 NA NA NA
3 2015-01-19 2 2 4
...
Thank you for your help.
library(dplyr)
df %>%
group_by(Date)%>%
summarise_at(.,c("A","B","C"),function(x) if(any(!is.na(x)))sum(x,na.rm = T) else NA)
# A tibble: 3 x 4
Date A B C
<fct> <int> <int> <int>
1 2015-01-17 1 NA 1
2 2015-01-18 NA NA NA
3 2015-01-19 2 2 4
data:
df <- structure(list(Date = structure(c(1L, 2L, 3L, 3L), .Label = c("2015-01-17",
"2015-01-18", "2015-01-19"), class = "factor"), A = c(1L, NA,
1L, 1L), B = c(NA, NA, 2L, NA), C = c(1L, NA, 3L, 1L)), class = "data.frame", row.names = c("1",
"2", "3", "4"))
Another option is sum_ from hablar
library(hablar)
library(dplyr)
df %>%
group_by(Date) %>%
summarise_if(is.numeric, sum_)
# A tibble: 3 x 4
# Date A B C
# <fct> <int> <int> <int>
#1 2015-01-17 1 NA 1
#2 2015-01-18 NA NA NA
#3 2015-01-19 2 2 4
data
df <- structure(list(Date = structure(c(1L, 2L, 3L, 3L), .Label = c("2015-01-17",
"2015-01-18", "2015-01-19"), class = "factor"), A = c(1L, NA,
1L, 1L), B = c(NA, NA, 2L, NA), C = c(1L, NA, 3L, 1L)),
class = "data.frame", row.names = c("1",
"2", "3", "4"))
I have the following data table
PIECE SAMPLE QC_CODE
1 1 1
2 1 NA
3 2 2
4 2 4
5 2 NA
6 3 6
7 3 3
8 3 NA
9 4 6
10 4 NA
and I would like to count the number of qc_code in each sample and return an output like this
SAMPLE SAMPLE_SIZE QC_CODE_COUNT
1 2 1
2 3 2
3 3 2
4 2 1
Where sample size is the count of pieces in each sample, and qc_code_count is the count of al qc_code that are no NA.
How would I go about this in R
You can try
library(dplyr)
df1 %>%
group_by(SAMPLE) %>%
summarise(SAMPLE_SIZE=n(), QC_CODE_UNIT= sum(!is.na(QC_CODE)))
# SAMPLE SAMPLE_SIZE QC_CODE_UNIT
#1 1 2 1
#2 2 3 2
#3 3 3 2
#4 4 2 1
Or
library(data.table)
setDT(df1)[,list(SAMPLE_SIZE=.N, QC_CODE_UNIT=sum(!is.na(QC_CODE))), by=SAMPLE]
Or using aggregate from base R
do.call(data.frame,aggregate(QC_CODE~SAMPLE, df1, na.action=NULL,
FUN=function(x) c(SAMPLE_SIZE=length(x), QC_CODE_UNIT= sum(!is.na(x)))))
data
df1 <- structure(list(PIECE = 1:10, SAMPLE = c(1L, 1L, 2L, 2L, 2L, 3L,
3L, 3L, 4L, 4L), QC_CODE = c(1L, NA, 2L, 4L, NA, 6L, 3L, NA,
6L, NA)), .Names = c("PIECE", "SAMPLE", "QC_CODE"), class = "data.frame",
row.names = c(NA, -10L))
Say I have a data table. I can create a column of lagged values:
>x
date id val valPr
1/4/14 a 1 2
1/3/14 a 2 3
1/2/14 a 3 4
1/1/14 a 4 NA
1/4/14 b 10 20
1/3/14 b 20 30
1/2/14 b 30 40
1/1/14 b 40 NA
Using:
setDT(x)[, valPr := c(val[-1], NA), by = "id"]
Is there a way to do something similar to lag by more than one period? Three for example?
It would produce something like this:
>x
date id val valPr
1/4/14 a 1 4
1/3/14 a 2 NA
1/2/14 a 3 NA
1/1/14 a 4 NA
1/4/14 b 10 40
1/3/14 b 20 NA
1/2/14 b 30 NA
1/1/14 b 40 NA
You could alternatively do the following. lead is a function in dplyr.
setDT(mydf)[, valPr2 := lead(val, 3), by = "id"]
# date id val valPr valPr2
#1: 1/4/14 a 1 2 4
#2: 1/3/14 a 2 3 NA
#3: 1/2/14 a 3 4 NA
#4: 1/1/14 a 4 NA NA
#5: 1/4/14 b 10 20 40
#6: 1/3/14 b 20 30 NA
#7: 1/2/14 b 30 40 NA
#8: 1/1/14 b 40 NA NA
DATA
mydf <- structure(list(date = structure(c(4L, 3L, 2L, 1L, 4L, 3L, 2L,
1L), .Label = c("1/1/14", "1/2/14", "1/3/14", "1/4/14"), class = "factor"),
id = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("a",
"b"), class = "factor"), val = c(1L, 2L, 3L, 4L, 10L, 20L,
30L, 40L), valPr = c(2L, 3L, 4L, NA, 20L, 30L, 40L, NA)), .Names = c("date",
"id", "val", "valPr"), class = "data.frame", row.names = c(NA,
-8L))
With data.table, you would do it like this:
nlags = 3
x[ by="id",
, valPr := c( val[ - seq(nlags) ], rep( NA, nlags) )
]
What this does is replaces the first nlags from val and then put that number of NA values at the end. You can adjust this to easily put the lagged values at the beginning or end of the series.