I have a dataset with two columns three columns. The third column has date value mixed with some strings.
ID Col1 Value
123 Start.Date 2011-06-18
123 Stem A1
123 Stem_1 A6
123 Stem_2 NA
321 Start.Date 2014-08-05
321 Stem C1
321 Stem_1 C4
321 Stem_2 NA
677 Start.Date NA
677 Stem NA
677 Stem_1 NA
677 Stem_2 NA
How can I separate out the dates and store them in a different column like this ?
ID Col1 Value Start.Date
123 Stem A1 2011-06-18
123 Stem_1 A6 2011-06-18
123 Stem_2 NA 2011-06-18
321 Stem C1 2014-08-05
321 Stem_1 C4 2014-08-05
321 Stem_2 NA 2014-08-05
677 Stem NA NA
677 Stem_1 NA NA
677 Stem_2 NA NA
Thanks.
An alternative solution based solely on tidyr:
df %>% pivot_wider(ID, names_from = Col1, values_from = Value ) %>%
pivot_longer(c("Stem", "Stem_1", "Stem_2"), names_to = "Col1", values_to = "Value")
Create a new column in the data which has value from Value column wehre Col1 = 'Start.Date' or NA otherwise. For each ID we can fill the NA value from the previous dates and remove the rows with 'Start.Date'.
library(dplyr)
library(tidyr)
df %>%
mutate(Start.Date = as.Date(replace(Value, Col1 != 'Start.Date', NA))) %>%
group_by(ID) %>%
fill(Start.Date) %>%
ungroup() %>%
filter(Col1 != 'Start.Date')
# ID Col1 Value Start.Date
# <int> <chr> <chr> <date>
#1 123 Stem A1 2011-06-18
#2 123 Stem_1 A6 2011-06-18
#3 123 Stem_2 NA 2011-06-18
#4 321 Stem C1 2014-08-05
#5 321 Stem_1 C4 2014-08-05
#6 321 Stem_2 NA 2014-08-05
#7 677 Stem NA NA
#8 677 Stem_1 NA NA
#9 677 Stem_2 NA NA
data
df <- structure(list(ID = c(123L, 123L, 123L, 123L, 321L, 321L, 321L,
321L, 677L, 677L, 677L, 677L), Col1 = c("Start.Date", "Stem",
"Stem_1", "Stem_2", "Start.Date", "Stem", "Stem_1", "Stem_2",
"Start.Date", "Stem", "Stem_1", "Stem_2"), Value = c("2011-06-18",
"A1", "A6", NA, "2014-08-05", "C1", "C4", NA, NA, NA, NA, NA)),
class = "data.frame", row.names = c(NA, -12L))
Related
I have a dataframe in the wide format such as below:
Subject
Volume.1
Volume.2
Volume.3
Volume.4
1
77
22
1
NA
2
65
182
NA
NA
3
98
NA
NA
NA
4
66
76
145
677
I am wanting to select the volume.1 and the column and the largest volume of Volume1-4 irrespective of which column it came from but am struggling to code this correctly. Some of the columns are Na when a subject does not have a recording then.
For instance with the above example the table would look like:
Subject
Volume.1
Worst volume
1
77
22
2
65
182
3
98
NA
4
66
677
I was wondering if anyone could help?
We may use pmax
cbind(df[1:2], WorseVolume = do.call(pmax, c(df[3:5], na.rm = TRUE)))
-output
Subject Volume.1 WorseVolume
1 1 77 22
2 2 65 182
3 3 98 NA
4 4 66 677
data
df <- structure(list(Subject = 1:4, Volume.1 = c(77L, 65L, 98L, 66L
), Volume.2 = c(22L, 182L, NA, 76L), Volume.3 = c(1L, NA, NA,
145L), Volume.4 = c(NA, NA, NA, 677L)), class = "data.frame", row.names = c(NA,
-4L))
I need to delete some rows in my dataset based on the given condition.
Kindly gothrough the sample data for reference.
ID Date Dur
123 01/05/2000 3
123 08/04/2002 6
564 04/04/2012 2
741 01/08/2011 5
789 02/03/2009 1
789 08/01/2010 NA
789 05/05/2011 NA
852 06/06/2015 3
852 03/02/2016 NA
155 03/02/2008 NA
155 01/01/2009 NA
159 07/07/2008 NA
My main concern is Dur column. I have to delete the rows which have Dur != NA for group ID's
i.e ID's(123,789,852) have more than one record/row with Dur value. so I need to remove the ID with Dur value, which means entire ID of 123 and first record of 789 and 852.
I don't want to delete any ID's(564,741,852) have Dur with single record or any other ID's with null in Dur.
Expected Output:
ID Date Dur
564 04/04/2012 2
741 01/08/2011 5
789 08/01/2010 NA
789 05/05/2011 NA
852 03/02/2016 NA
155 03/02/2008 NA
155 01/01/2009 NA
159 07/07/2008 NA
Kindly suggest a code to solve the issue.
Thanks in Advance!
One way would be to select rows where number of rows in the group is 1 or there are NA's rows in the data.
This can be written in dplyr as :
library(dplyr)
df %>% group_by(ID) %>% filter(n() == 1 | is.na(Dur))
# ID Date Dur
# <int> <chr> <int>
#1 564 04/04/2012 2
#2 741 01/08/2011 5
#3 789 08/01/2010 NA
#4 789 05/05/2011 NA
#5 852 03/02/2016 NA
#6 155 03/02/2008 NA
#7 155 01/01/2009 NA
#8 159 07/07/2008 NA
Using data.table :
library(data.table)
setDT(df)[, .SD[.N == 1 | is.na(Dur)], ID]
and base R :
subset(df, ave(is.na(Dur), ID, FUN = function(x) length(x) == 1 | x))
data
df <- structure(list(ID = c(123L, 123L, 564L, 741L, 789L, 789L, 789L,
852L, 852L, 155L, 155L, 159L), Date = c("01/05/2000", "08/04/2002",
"04/04/2012", "01/08/2011", "02/03/2009", "08/01/2010", "05/05/2011",
"06/06/2015", "03/02/2016", "03/02/2008", "01/01/2009", "07/07/2008"
), Dur = c(3L, 6L, 2L, 5L, 1L, NA, NA, 3L, NA, NA, NA, NA)),
class = "data.frame", row.names = c(NA, -12L))
We can use .I in data.table
library(data.table)
setDT(df1)[df1[, .I[.N == 1| is.na(Dur)], ID]$V1]
I have a DF that lists IDs by date like this:
Date Ben James
12/10/17 1294 NA
12/11/17 NA 4523
12/12/17 8959 3246
12/13/17 2345 NA
12/14/17 NA NA
12/15/17 0303 8877
12/16/17 NA 1427
The number of "name" columns is variable, so on another day I might have a DF that looks like this:
Date Ben James Alex
12/10/17 1294 NA 3754
12/11/17 NA 4523 1122
12/12/17 8959 3246 5582
12/13/17 2345 NA NA
12/14/17 NA NA 0094
12/15/17 0303 8877 NA
12/16/17 NA 1427 NA
I want to put the 3 most recent IDs for each name column into a new dataframe, like this:
IDs
8959
2345
0303
3246
8877
1427
1122
5582
0094
I just need the IDs in the new DF. I don't care about labeling them by name or date.
c(sapply(df[-1], function(x) sprintf("%04d", tail(x[!is.na(x)], 3))))
#[1] "8959" "2345" "0303" "3246" "8877" "1427" "1122" "5582" "0094"
DATA
df = structure(list(Date = c("12/10/17", "12/11/17", "12/12/17", "12/13/17",
"12/14/17", "12/15/17", "12/16/17"), Ben = c(1294L, NA, 8959L,
2345L, NA, 303L, NA), James = c(NA, 4523L, 3246L, NA, NA, 8877L,
1427L), Alex = c(3754L, 1122L, 5582L, NA, 94L, NA, NA)), .Names = c("Date",
"Ben", "James", "Alex"), class = "data.frame", row.names = c(NA,
-7L))
res <- do.call(rbind,
apply(df[, -1], 2, function(x) data.frame(IDs = tail(na.omit(x), 3))))
Here is an option using tidyverse
library(tidyverse)
df %>%
summarise_at(vars(-one_of('Date')), funs(list(tail(.[!is.na(.)], 3)))) %>%
unlist(., use.names = FALSE) %>%
str_pad(width = 4, pad=0)
#[1] "8959" "2345" "0303" "3246" "8877" "1427" "1122" "5582" "0094"
I have a data frame consisting of character variables which looks like this:
V1 V2 V3 V4 V5
1 ID Date pic1 pic2 pic3
2 1 15.06.16 11:50 abc <NA> def
3 1 16.06.16 11:19 <NA> hij <NA>
4 1 17.06.16 11:41 <NA> <NA> nop
5 2 28.05.16 11:40 tuv <NA> <NA>
6 2 29.05.16 11:39 <NA> zab <NA>
7 2 30.05.16 09:07 <NA> <NA> wxy
8 3 03.06.16 07:31 lmn <NA> <NA>
9 3 04.06.16 11:01 <NA> rst <NA>
10 3 05.06.16 13:57 <NA> <NA> opq
So on each day one of the pic-variables contains a value, the rest is NA.
Now I want to combine all pic-values into one variable by replacing the NA's. Sorry if this is a dublicate, I've already tried a lot of suggested solutions but nothing has worked so far.
Thanks!
We can try with data.table. Convert the 'data.frame' to 'data.table' (setDT(df1), grouped by 'ID', and 'Date', we unlist the Subset of Data.table (.SD) and omit the NA elements (na.omit)
library(data.table)
setDT(df1)[, .(pic = na.omit(unlist(.SD))), by = .(ID, Date)]
# ID Date pic
# 1: 1 15.06.16 11:50 abc
# 2: 1 15.06.16 11:50 def
# 3: 1 16.06.16 11:19 hij
# 4: 1 17.06.16 11:41 nop
# 5: 2 28.05.16 11:40 tuv
# 6: 2 29.05.16 11:39 zab
# 7: 2 30.05.16 09:07 wxy
# 8: 3 03.06.16 07:31 lmn
# 9: 3 04.06.16 11:01 rst
#10: 3 05.06.16 13:57 opq
Or another option is pmax if there is only a single non-NA per row
setDT(df1)[, pic := do.call(pmax, c(.SD, na.rm = TRUE)),
.SDcols = pic1:pic3][, paste0("pic", 1:3) := NULL][]
Or using dplyr
library(dplyr)
df1 %>%
mutate(pic = pmax(pic1, pic2, pic3, na.rm=TRUE))%>%
select(-(pic1:pic3))
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), Date = c("15.06.16 11:50",
"16.06.16 11:19", "17.06.16 11:41", "28.05.16 11:40", "29.05.16 11:39",
"30.05.16 09:07", "03.06.16 07:31", "04.06.16 11:01", "05.06.16 13:57"
), pic1 = c("abc", NA, NA, "tuv", NA, NA, "lmn", NA, NA), pic2 = c(NA,
"hij", NA, NA, "zab", NA, NA, "rst", NA), pic3 = c("def", NA,
"nop", NA, NA, "wxy", NA, NA, "opq")), .Names = c("ID", "Date",
"pic1", "pic2", "pic3"), row.names = c(NA, -9L), class = "data.frame")
Assuming
on each day one of the pic-variables contains a value, the rest is NA
You can use coalesce from dplyr to get what you want:
library(dplyr)
result <- df1 %>% mutate(pic = coalesce(pic1, pic2, pic3)) %>%
select(-(pic1:pic3))
With the data supplied by akrun:
print(result)
## ID Date pic
##1 1 15.06.16 11:50 abc
##2 1 16.06.16 11:19 hij
##3 1 17.06.16 11:41 nop
##4 2 28.05.16 11:40 tuv
##5 2 29.05.16 11:39 zab
##6 2 30.05.16 09:07 wxy
##7 3 03.06.16 07:31 lmn
##8 3 04.06.16 11:01 rst
##9 3 05.06.16 13:57 opq
Date_Time C 4700C Put.15 4800C Put.16 4900C Put.17
1 20120531 NA NA NA NA NA NA NA
2 20120601 1445 4800 208 84.9 143.3 119.8 92 167
3 20120606 1100 4900 268.85 43 192 66.3 127 100
4 20120607 1500 5000 345 24 261 38.25 183 60.5
5 20120612 1515 NA NA NA NA NA NA NA
I have the above sample data frame, here i wants to search the values of 1st row for C column in all the column names and get back the values of the matching column as the result.
For example <- wants to search the value of 2nd row C column which is 4900, in first all the column names, and once it's found 4900C, gives me the result as all the values in 4900C for 2nd row.
Pls help
It would have been better if the delimiters in the data were clear. For example "Date_Time" as column name could have elements "20120531 NA" as a string.
We remove the non-numeric substring from the names of the 'df1' (subset it based on 'j2') using sub, match with the 'C' to get the column index ('j1'), get a logical index based on the NA values ('i1'), then with row/column index, we extract the elements from the proposed columns (df1[-(1:3)]`) and assign it to a "NewCol".
j2 <- grep("\\d+C", names(df1))
j1 <- match(df1$C, sub("\\D+", "", names(df1)[j2]))
i1 <- !is.na(j1)
df1$NewCol[i1] <- df1[j2][cbind((1:nrow(df1))[i1], j1[i1])]
df1
# Date Time C 4700C Put.15 4800C Put.16 4900C Put.17 NewCol
#1 20120531 NA NA NA NA NA NA NA NA NA
#2 20120601 1445 4800 208.00 84.9 143.3 119.80 92 167.0 143.3
#3 20120606 1100 4900 268.85 43.0 192.0 66.30 127 100.0 127.0
#4 20120607 1500 5000 345.00 24.0 261.0 38.25 183 60.5 NA
#5 20120612 1515 NA NA NA NA NA NA NA NA
NOTE: Here I am assuming that 'Time' is the second column
data
df1 <- structure(list(Date = c(20120531L, 20120601L, 20120606L, 20120607L,
20120612L), Time = c(NA, 1445L, 1100L, 1500L, 1515L), C = c(NA,
4800L, 4900L, 5000L, NA), `4700C` = c(NA, 208, 268.85, 345, NA
), Put.15 = c(NA, 84.9, 43, 24, NA), `4800C` = c(NA, 143.3, 192,
261, NA), Put.16 = c(NA, 119.8, 66.3, 38.25, NA), `4900C` = c(NA,
92L, 127L, 183L, NA), Put.17 = c(NA, 167, 100, 60.5, NA)),
.Names = c("Date",
"Time", "C", "4700C", "Put.15", "4800C", "Put.16", "4900C", "Put.17"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5"))