Extracting nth value from row vector in R - r

I have been searching/thinking of a way in which I can extract the nth value (e.g. 2nd, 5th, 7th, etc.) from each row in my data frame.
For example, I have the following columns:
ID Q1-2013 Q2-2013 Q3-2013 Q4-2013 Q1-2014 Q2-2014 Q3-2014 Q4-2014
Under each column there are given values. What I would like to do is pull the nth value of each row from the quarters vector (2nd-8th columns). So for example, if I am looking for the 2nd value from each row, the formula/function I want would extract/pull the 2nd value from each row from columns 2-8 (Q1-2013 to Q4-2014). In addition, the formula/function would ignore the blanks/NA values in each row as well.

Maybe this is what you're after.
I first modified the iris data set with some NAs in each column:
iris[] <- lapply(iris, function(x){ x[sample(150, 30, F)] <- NA; x})
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.5 1.4 NA setosa
#2 NA NA 1.4 NA setosa
#3 NA NA 1.3 0.2 setosa
#4 4.6 3.1 1.5 NA setosa
#5 5.0 3.6 1.4 0.2 setosa
#6 5.4 NA 1.7 0.4 setosa
Then, to extract the second non-empty and non-NA entries per row you could use apply (I know, it's not recommended on data frames, but it does the dirty job):
apply(iris, 1, function(x) x[which(!is.na(x) & x != "")[2]])
# [1] "3.5" "setosa" "0.2" "3.1" "3.6" "1.7" "3.4" "3.4" "2.9" "3.1" "setosa"
#[12] "3.4" "1.4" "1.1" "1.2" "4.4" "3.9" "3.5" "3.8" "3.8" "0.2" "3.7"
#[23] "3.6" "1.7" "1.9" "3.0" "3.4" "1.5" "3.4" "3.2" "3.1" "3.4" "4.1"
#[34] "4.2" "3.1" "3.2" "3.5" "3.6" "setosa" "1.5" "1.3" "2.3" "1.3" "0.6"
#[45] "0.4" "3.0" "3.8" "3.2" "3.7" "3.3" "3.2" "3.2" "1.5" "2.3" "2.8"
#[56] "2.8" "3.3" "2.4" "4.6" "1.4" "2.0" "3.0" "1.0" "2.9" "2.9" "3.1"
#[67] "3.0" "2.7" "4.5" "3.9" "3.2" "4.0" "2.5" "4.7" "4.3" "3.0" "2.8"
#[78] "5.0" "2.9" "3.5" "3.8" "2.4" "2.7" "2.7" "3.0" "3.4" "3.1" "1.3"
#[89] "4.1" "1.3" "2.6" "3.0" "2.6" "2.3" "4.2" "3.0" "2.9" "2.9" "2.5"
#[100] "2.8" "3.3" "2.7" "3.0" "2.9" "3.0" "3.0" "4.5" "2.9" "5.8" "3.6"
#[111] "3.2" "1.9" "5.5" "2.0" "5.1" "3.2" "5.5" "3.8" "virginica" "1.5" "3.2"
#[122] "2.8" "2.8" "2.7" "2.1" "6.0" "2.8" "3.0" "2.8" "5.8" "2.8" "3.8"
#[133] "5.6" "1.5" "2.6" "3.0" "5.6" "5.5" "4.8" "3.1" "5.6" "5.1" "2.7"
#[144] "3.2" "3.3" "3.0" "2.5" "5.2" "5.4" "3.0"
Because apply will first convert the data frame to a matrix, all columns are covnerted to the same type which is character in this case. You can later on convert it to whatever you want (but note that you cant convert the output vector in this case directly back to numeric since it contains some character strings such as "setosa" etc).

You could also use a convenient function naLast from library(SOfun)
library(SOfun)
dat[dat==''] <- NA #convert all `blank` cells to `NA`
n <- 2 # the row/column index that needs to be extracted
naLast(dat, by='col')[n,] #get the 2nd non-empty/nonNA element for each columns
#V1 V2 V3 V4 V5
#"G" "B" "B" "B" "C"
which would be the same with apply
apply(dat, 2, function(x) x[which(!is.na(x) & x!='')[2]])
#V1 V2 V3 V4 V5
#"G" "B" "B" "B" "C"
You could also specify by='row'
naLast(dat, by='row')[,n] #get the 2nd non-empty/nonNA element for each row
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
#"G" "D" "B" "G" "E" "B" "J" "F" "F" "A" "H" "C" "A" "D" "H" "D" "J" "C" "A" "A"
data
set.seed(25)
dat <- as.data.frame(matrix(sample(c(NA,'',LETTERS[1:10]),
20*5, replace=TRUE), ncol=5), stringsAsFactors=FALSE)
You can install the package by
library(devtools)
install_github("mrdwab/SOfun")

Related

anova() does not work properly with lme objects after updating - do I miss something?

I had a code that worked fine so far. I want to test things with gls, lme and gamm (from packages nlme and mgcv), and I compared different models with anova(). However, I needed another package, that did not work with my R version (which was almost one year old). Thus, I updated R (via the updater package) and RStudio.
The issue now is, that anova() does not give any output after running or only "Denom. DF: 91" and nothing else.
Now I tried different things and searched a lot, but I found no current threat dealing with such a problem, while looking at the help files just says, it should work that way I use it. Thus, I am suspecting that I miss something essential (probably even obvious), but I don't get it. I hope you can tell me where I do something wrong.
Here is some data to play with (copied from txt-file):
"treat" "x" "time" "nest"
"1" "1" 49.37 1 "K1"
"2" "1" 48.68 1 "K2"
"3" "2" 44.7 1 "T7"
"4" "2" 49.3 1 "T8"
"5" "1" 48.78 1 "K3"
"6" "2" 42.37 1 "T10"
"7" "1" 39.26 1 "K4"
"8" "2" 46.36 1 "T11"
"9" "1" 40.36 1 "K5"
"10" "2" 47.14 1 "T9"
"11" "1" 48.81 1 "K6"
"12" "1" 40.4 1 "K10"
"13" "2" 53.42 1 "T4"
"14" "2" 46.85 1 "T5"
"15" "2" 44.58 1 "T2"
"16" "2" 47.51 1 "T6"
"17" "1" 51.7 1 "K8"
"18" "1" 48.16 1 "K7"
"19" "2" 48.86 1 "T3"
"20" "1" 44.6 1 "K11"
"21" "1" 49.71 1 "K9"
"22" "2" 44.54 1 "T1"
"23" "2" 41.55 2 "T3"
"24" "1" 32.55 2 "K3"
"25" "1" 42.15 2 "K1"
"26" "2" 51.06 2 "T1"
"27" "1" 38.43 2 "K11"
"28" "2" 39.91 2 "T11"
"29" "1" 36.73 2 "K7"
"30" "2" 50.19 2 "T4"
"31" "1" 42.26 2 "K8"
"32" "1" 43.02 2 "K6"
"33" "2" 37.6 2 "T10"
"34" "1" 33.42 2 "K4"
"35" "2" 39.64 2 "T5"
"36" "2" 43.56 2 "T2"
"37" "2" 35.31 2 "T7"
"38" "2" 37 2 "T8"
"39" "2" 40.87 2 "T6"
"40" "1" 35.29 2 "K9"
"41" "2" 41.83 2 "T9"
"42" "1" 37.88 2 "K10"
"43" "1" 36.5 2 "K5"
"44" "1" 34.21 3 "K4"
"45" "1" 38.04 3 "K6"
"46" "1" 35.14 3 "K3"
"47" "2" 38.18 3 "T10"
"48" "1" 40.26 3 "K11"
"49" "2" 37.09 3 "T3"
"50" "2" 43.1 3 "T11"
"51" "2" 34.26 3 "T7"
"52" "1" 36.58 3 "K9"
"53" "1" 35.81 3 "K2"
"54" "1" 39.83 3 "K10"
"55" "2" 37.65 3 "T6"
"56" "1" 39.8 3 "K7"
"57" "1" 36.41 3 "K8"
"58" "1" 35.22 3 "K5"
"59" "2" 39.68 3 "T8"
"60" "2" 41.12 3 "T1"
"61" "2" 36.93 3 "T9"
"62" "1" 35.66 3 "K1"
"63" "2" 36.91 3 "T4"
"64" "2" 38.84 3 "T5"
"65" "2" 34.31 3 "T2"
"66" "1" 32.71 4 "K9"
"67" "2" 37.84 4 "T11"
"68" "1" 28.01 4 "K10"
"69" "2" 39.69 5 "T11"
"70" "2" 35.08 4 "T10"
"71" "2" 34.43 4 "T9"
"72" "1" 32.12 4 "T8"
"73" "2" 30.41 4 "T7"
"74" "1" 31.81 4 "K7"
"75" "2" 36.41 4 "T6"
"76" "1" 29.17 5 "K6"
"77" "1" 28.59 4 "K6"
"78" "2" 33.99 4 "T5"
"79" "1" 30.41 4 "K5"
"80" "1" 29.8 4 "K4"
"81" "2" 34.72 4 "T4"
"82" "2" 34.38 4 "T3"
"83" "1" 28.12 4 "K3"
"84" "2" 34.62 4 "T2"
"85" "1" 31.88 4 "K2"
"86" "1" 29.35 4 "K1"
"87" "2" 37.95 4 "T1"
"88" "2" 40.85 5 "T4"
"89" "2" 35.07 5 "T5"
"90" "2" 36.15 5 "T8"
"91" "2" 36.48 5 "T10"
"92" "1" 33.73 4 "K8"
"93" "1" 28.17 5 "K9"
"94" "1" 32.81 5 "K10"
"95" "1" 32.17 4 "K11"
And this is basically one of the models I try to run:
test <- read.table(file="C:/Users/marvi_000/Desktop/testdata.txt")
str(test)
test$treat <- as.factor(test$treat)
test$nest <- as.factor(test$nest)
library(nlme)
m.test <- gls(x ~ treat * time,
correlation = corAR1(form =~ time | nest),
test, na.action = na.omit)
anova(m.test)
the output is:
Denom. DF: 91
When comparing models with anova(m1, m2) nothing happens at all.
The same is true when I run a gamm from package mgcv and using anova(m$lme) or anova(m1$lme, m2$lme).
I would appreciate any help or hint, pointing me towards the right direction. Thanks a lot!
EDIT:
After some discussion, I found out, that it is a problem with the scripts. I'm using RStudio and RMarkdown. However, when I run the code (with cntrl+enter, line by line) within the markdown script, the anova(lmemodel) command does not work as supposed to. However, if I just copy this single command into a plane r script (still using the current environment), the command is executed properly showing the desired output.
I have no clue what is happening there. If anybody has an idea where the problem is, or how to solve it, I would still be happy to hear it.

write.csv returns a blank file in R

I have a data set that is in a .Rdata format - something I haven't worked with before. I would like to export the data to a csv or related file for use in Python. I've used "write.csv", "write.table", and a few others and while they all seem like they are writing to the file, when I open it it's completely blank. I've also tried converting the data to a dataframe before exporting with no luck so far.
After importing the file in R, the data is labeled as a Large array (1499904 elements, 11.5 Mb) with the following attributes:
> attributes(data.station)
$`dim`
[1] 12 31 288 7 2
$dimnames
$dimnames[[1]]
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
$dimnames[[2]]
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20" "21"
[22] "22" "23" "24" "25" "26" "27" "28" "29" "30" "31"
$dimnames[[3]]
[1] "" "00:05:00" "00:10:00" "00:15:00" "00:20:00" "00:25:00" "00:30:00" "00:35:00" "00:40:00"
[10] "00:45:00" "00:50:00" "00:55:00" "01:00:00" "01:05:00" "01:10:00" "01:15:00" "01:20:00" "01:25:00"
[19] "01:30:00" "01:35:00" "01:40:00" "01:45:00" "01:50:00" "01:55:00" "02:00:00" "02:05:00" "02:10:00"
[28] "02:15:00" "02:20:00" "02:25:00" "02:30:00" "02:35:00" "02:40:00" "02:45:00" "02:50:00" "02:55:00"
[37] "03:00:00" "03:05:00" "03:10:00" "03:15:00" "03:20:00" "03:25:00" "03:30:00" "03:35:00" "03:40:00"
[46] "03:45:00" "03:50:00" "03:55:00" "04:00:00" "04:05:00" "04:10:00" "04:15:00" "04:20:00" "04:25:00"
[55] "04:30:00" "04:35:00" "04:40:00" "04:45:00" "04:50:00" "04:55:00" "05:00:00" "05:05:00" "05:10:00"
[64] "05:15:00" "05:20:00" "05:25:00" "05:30:00" "05:35:00" "05:40:00" "05:45:00" "05:50:00" "05:55:00"
[73] "06:00:00" "06:05:00" "06:10:00" "06:15:00" "06:20:00" "06:25:00" "06:30:00" "06:35:00" "06:40:00"
[82] "06:45:00" "06:50:00" "06:55:00" "07:00:00" "07:05:00" "07:10:00" "07:15:00" "07:20:00" "07:25:00"
[91] "07:30:00" "07:35:00" "07:40:00" "07:45:00" "07:50:00" "07:55:00" "08:00:00" "08:05:00" "08:10:00"
[100] "08:15:00" "08:20:00" "08:25:00" "08:30:00" "08:35:00" "08:40:00" "08:45:00" "08:50:00" "08:55:00"
[109] "09:00:00" "09:05:00" "09:10:00" "09:15:00" "09:20:00" "09:25:00" "09:30:00" "09:35:00" "09:40:00"
[118] "09:45:00" "09:50:00" "09:55:00" "10:00:00" "10:05:00" "10:10:00" "10:15:00" "10:20:00" "10:25:00"
[127] "10:30:00" "10:35:00" "10:40:00" "10:45:00" "10:50:00" "10:55:00" "11:00:00" "11:05:00" "11:10:00"
[136] "11:15:00" "11:20:00" "11:25:00" "11:30:00" "11:35:00" "11:40:00" "11:45:00" "11:50:00" "11:55:00"
[145] "12:00:00" "12:05:00" "12:10:00" "12:15:00" "12:20:00" "12:25:00" "12:30:00" "12:35:00" "12:40:00"
[154] "12:45:00" "12:50:00" "12:55:00" "13:00:00" "13:05:00" "13:10:00" "13:15:00" "13:20:00" "13:25:00"
[163] "13:30:00" "13:35:00" "13:40:00" "13:45:00" "13:50:00" "13:55:00" "14:00:00" "14:05:00" "14:10:00"
[172] "14:15:00" "14:20:00" "14:25:00" "14:30:00" "14:35:00" "14:40:00" "14:45:00" "14:50:00" "14:55:00"
[181] "15:00:00" "15:05:00" "15:10:00" "15:15:00" "15:20:00" "15:25:00" "15:30:00" "15:35:00" "15:40:00"
[190] "15:45:00" "15:50:00" "15:55:00" "16:00:00" "16:05:00" "16:10:00" "16:15:00" "16:20:00" "16:25:00"
[199] "16:30:00" "16:35:00" "16:40:00" "16:45:00" "16:50:00" "16:55:00" "17:00:00" "17:05:00" "17:10:00"
[208] "17:15:00" "17:20:00" "17:25:00" "17:30:00" "17:35:00" "17:40:00" "17:45:00" "17:50:00" "17:55:00"
[217] "18:00:00" "18:05:00" "18:10:00" "18:15:00" "18:20:00" "18:25:00" "18:30:00" "18:35:00" "18:40:00"
[226] "18:45:00" "18:50:00" "18:55:00" "19:00:00" "19:05:00" "19:10:00" "19:15:00" "19:20:00" "19:25:00"
[235] "19:30:00" "19:35:00" "19:40:00" "19:45:00" "19:50:00" "19:55:00" "20:00:00" "20:05:00" "20:10:00"
[244] "20:15:00" "20:20:00" "20:25:00" "20:30:00" "20:35:00" "20:40:00" "20:45:00" "20:50:00" "20:55:00"
[253] "21:00:00" "21:05:00" "21:10:00" "21:15:00" "21:20:00" "21:25:00" "21:30:00" "21:35:00" "21:40:00"
[262] "21:45:00" "21:50:00" "21:55:00" "22:00:00" "22:05:00" "22:10:00" "22:15:00" "22:20:00" "22:25:00"
[271] "22:30:00" "22:35:00" "22:40:00" "22:45:00" "22:50:00" "22:55:00" "23:00:00" "23:05:00" "23:10:00"
[280] "23:15:00" "23:20:00" "23:25:00" "23:30:00" "23:35:00" "23:40:00" "23:45:00" "23:50:00" "23:55:00"
$dimnames[[4]]
[1] "tempinf" "tempf" "humidityin" "humidity" "solarradiation" "hourlyrainin"
[7] "windspeedmph"
$dimnames[[5]]
[1] "2020" "2021"
Any advice on how to handle this? Thank you!
You have to flatten the array to write it. First we create a reproducible example of your data:
x <- 1:(2 * 3 * 4 * 5 * 6)
dnames <- list(LETTERS[1:2], LETTERS[3:5], LETTERS[6:9], LETTERS[10:14], LETTERS[15:20])
y <- array(x, dim=c(2, 3, 4, 5, 6), dimnames=dnames)
str(y)
# int [1:2, 1:3, 1:4, 1:5, 1:6] 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*, "dimnames")=List of 5
# ..$ : chr [1:2] "A" "B"
# ..$ : chr [1:3] "C" "D" "E"
# ..$ : chr [1:4] "F" "G" "H" "I"
# ..$ : chr [1:5] "J" "K" "L" "M" ...
# ..$ : chr [1:6] "O" "P" "Q" "R" ...
attributes(y)
# $dim
# [1] 2 3 4 5 6
#
# $dimnames
# $dimnames[[1]]
# [1] "A" "B"
#
# $dimnames[[2]]
# [1] "C" "D" "E"
#
# $dimnames[[3]]
# [1] "F" "G" "H" "I"
#
# $dimnames[[4]]
# [1] "J" "K" "L" "M" "N"
#
# $dimnames[[5]]
# [1] "O" "P" "Q" "R" "S" "T"
Now we flatten the array and write it to a file:
z <- as.data.frame.table(y)
str(z)
# 'data.frame': 720 obs. of 6 variables:
# $ Var1: Factor w/ 2 levels "A","B": 1 2 1 2 1 2 1 2 1 2 ...
# $ Var2: Factor w/ 3 levels "C","D","E": 1 1 2 2 3 3 1 1 2 2 ...
# $ Var3: Factor w/ 4 levels "F","G","H","I": 1 1 1 1 1 1 2 2 2 2 ...
# $ Var4: Factor w/ 5 levels "J","K","L","M",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ Var5: Factor w/ 6 levels "O","P","Q","R",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ Freq: int 1 2 3 4 5 6 7 8 9 10 ...
write.csv(z, file="dfz.csv", row.names=FALSE)
Finally we read the file and convert it back to an array:
a <- read.csv("dfz.csv", as.is=FALSE)
b <- xtabs(Freq~., a)
class(b) <- "array"
attr(b, "call") <- NULL
names(dimnames(b)) <- NULL
str(b)
# int [1:2, 1:3, 1:4, 1:5, 1:6] 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*, "dimnames")=List of 5
# ..$ : chr [1:2] "A" "B"
# ..$ : chr [1:3] "C" "D" "E"
# ..$ : chr [1:4] "F" "G" "H" "I"
# ..$ : chr [1:5] "J" "K" "L" "M" ...
# ..$ : chr [1:6] "O" "P" "Q" "R" ...

Using "LETTERS" to add a row to dataset

I have a dataset that contains 301 columns. I want to add a row that contains a unique letter starting with a and repeats until the final column. What I have is:
DF:
Var1 Var2.....Var78 Var79...Var130V Var131
What I want in my extra row is:
DF:
Var1 Var2.....Var78 Var79...Var130V Var131
NR A B CA CB EA EB
I have tried
DF[[nrow(DF)+1,]<- rep(letters)
DF[nrow(DF)+1,]<- paste(letters:ncol)
Any help would be appreciated!
You could have the same unique letter headers that you would get in "Excel" columns using this one-liner:
c(LETTERS, c(sapply(LETTERS, paste0, LETTERS)))[1:301]
#> [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M"
#> [14] "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
#> [27] "AA" "AB" "AC" "AD" "AE" "AF" "AG" "AH" "AI" "AJ" "AK" "AL" "AM"
#> [40] "AN" "AO" "AP" "AQ" "AR" "AS" "AT" "AU" "AV" "AW" "AX" "AY" "AZ"
#> [53] "BA" "BB" "BC" "BD" "BE" "BF" "BG" "BH" "BI" "BJ" "BK" "BL" "BM"
#> [66] "BN" "BO" "BP" "BQ" "BR" "BS" "BT" "BU" "BV" "BW" "BX" "BY" "BZ"
#> [79] "CA" "CB" "CC" "CD" "CE" "CF" "CG" "CH" "CI" "CJ" "CK" "CL" "CM"
#> [92] "CN" "CO" "CP" "CQ" "CR" "CS" "CT" "CU" "CV" "CW" "CX" "CY" "CZ"
#> [105] "DA" "DB" "DC" "DD" "DE" "DF" "DG" "DH" "DI" "DJ" "DK" "DL" "DM"
#> [118] "DN" "DO" "DP" "DQ" "DR" "DS" "DT" "DU" "DV" "DW" "DX" "DY" "DZ"
#> [131] "EA" "EB" "EC" "ED" "EE" "EF" "EG" "EH" "EI" "EJ" "EK" "EL" "EM"
#> [144] "EN" "EO" "EP" "EQ" "ER" "ES" "ET" "EU" "EV" "EW" "EX" "EY" "EZ"
#> [157] "FA" "FB" "FC" "FD" "FE" "FF" "FG" "FH" "FI" "FJ" "FK" "FL" "FM"
#> [170] "FN" "FO" "FP" "FQ" "FR" "FS" "FT" "FU" "FV" "FW" "FX" "FY" "FZ"
#> [183] "GA" "GB" "GC" "GD" "GE" "GF" "GG" "GH" "GI" "GJ" "GK" "GL" "GM"
#> [196] "GN" "GO" "GP" "GQ" "GR" "GS" "GT" "GU" "GV" "GW" "GX" "GY" "GZ"
#> [209] "HA" "HB" "HC" "HD" "HE" "HF" "HG" "HH" "HI" "HJ" "HK" "HL" "HM"
#> [222] "HN" "HO" "HP" "HQ" "HR" "HS" "HT" "HU" "HV" "HW" "HX" "HY" "HZ"
#> [235] "IA" "IB" "IC" "ID" "IE" "IF" "IG" "IH" "II" "IJ" "IK" "IL" "IM"
#> [248] "IN" "IO" "IP" "IQ" "IR" "IS" "IT" "IU" "IV" "IW" "IX" "IY" "IZ"
#> [261] "JA" "JB" "JC" "JD" "JE" "JF" "JG" "JH" "JI" "JJ" "JK" "JL" "JM"
#> [274] "JN" "JO" "JP" "JQ" "JR" "JS" "JT" "JU" "JV" "JW" "JX" "JY" "JZ"
#> [287] "KA" "KB" "KC" "KD" "KE" "KF" "KG" "KH" "KI" "KJ" "KK" "KL" "KM"
#> [300] "KN" "KO"
To write this into your first row in DF you could do:
DF <- rbind(c(LETTERS, c(sapply(LETTERS, paste0, LETTERS)))[1:301], DF)

Transforming Dates in R

I have a large csv file in which relevant dates are categorical and formatted in one column as follows: "Thu, 21 Jan 2012 04:59:00 -0000". I am trying to use as.Date, but it doesn't seem to be working. It would be create to have several columns for weekday, day, month, year, but I am happy to settle for one column at this point. Any suggestions?
UPDATE QUESTION: Each row has a different date in the above format (weekday, day, month, year, hour, minutes, seconds. I did not make that clear. How do I transform each date in the column?
The anytime package can parse this without a format:
R> anytime("Thu, 21 Jan 2012 04:59:00 -0000")
[1] "2012-01-21 04:59:00 CST"
R>
It returns a POSIXct you can then operate on, or just format(), at will. It also has a simpler variant anydate() which returns a Date object instead.
library(lubridate)
my_date <- "Thu, 21 Jan 2012 04:59:00 -0000"
# Get it into date format
my_date <- dmy_hms(my_date)
# Use convenience functions to set up the columns you wanted
data.frame(day=day(my_date), month=month(my_date), year=year(my_date),
timestamp = my_date)
day month year timestamp
1 21 1 2012 2012-01-21 04:59:00
We can use
as.Date(str1, "%a, %d %b %Y")
#[1] "2012-01-21"
If we need DateTime format
v1 <- strptime(str1, '%a, %d %b %Y %H:%M:%S %z', tz = "UTC")
v1
#[1] "2012-01-21 04:59:00 UTC"
Or using lubridate
library(lubridate)
dmy_hms(str1)
#[1] "2012-01-21 04:59:00 UTC"
data
str1 <- "Thu, 21 Jan 2012 04:59:00 -0000"
If you really want the separation in components then start with Dirk's powerful suggestion and then transpose the output of as.POSIXlt:
library(anytime)
times <- c("2004-03-21 12:45:33.123456", # example from ?anytime
"2004/03/21 12:45:33.123456",
"20040321 124533.123456",
"03/21/2004 12:45:33.123456",
"03-21-2004 12:45:33.123456",
"2004-03-21",
"20040321",
"03/21/2004",
"03-21-2004",
"20010101")
t( sapply( anytime::anytime(times),
function(x) unlist( as.POSIXlt(x)) ) )
sec min hour mday mon year wday yday isdst
[1,] "33.1234560012817" "45" "12" "21" "2" "104" "0" "80" "0"
[2,] "33.1234560012817" "45" "12" "21" "2" "104" "0" "80" "0"
[3,] "33.1234560012817" "45" "12" "21" "2" "104" "0" "80" "0"
[4,] "33.1234560012817" "45" "12" "21" "2" "104" "0" "80" "0"
[5,] "33.1234560012817" "45" "12" "21" "2" "104" "0" "80" "0"
[6,] "0" "0" "0" "21" "2" "104" "0" "80" "0"
[7,] "0" "0" "0" "21" "2" "104" "0" "80" "0"
[8,] "0" "0" "0" "21" "2" "104" "0" "80" "0"
[9,] "0" "0" "0" "21" "2" "104" "0" "80" "0"
[10,] "0" "0" "0" "1" "9" "101" "1" "273" "1"
zone gmtoff
[1,] "PST" "-28800"
[2,] "PST" "-28800"
[3,] "PST" "-28800"
[4,] "PST" "-28800"
[5,] "PST" "-28800"
[6,] "PST" "-28800"
[7,] "PST" "-28800"
[8,] "PST" "-28800"
[9,] "PST" "-28800"
[10,] "PDT" "-25200"

R: find first non-NA observation in data.table column by group

I have a data.table with many missing values and I want a variable which gives me a 1 for the first non-missin value in each group.
Say I have such a data.table:
library(data.table)
DT <- data.table(iris)[,.(Petal.Width,Species)]
DT[c(1:10,15,45:50,51:70,101:134),Petal.Width:=NA]
which now has missings in the beginning, at the end and in between. I have tried two versions, one is:
DT[min(which(!is.na(Petal.Width))),first_available:=1,by=Species]
but it only finds the global minimum (in this case, setosa gets the correct 1), not the minimum by group. I think this is the case because data.table first subsets by i, then sorts by group, correct? So it will only work with the row that is the global minimum of which(!is.na(Petal.Width)) which is the first non-NA value.
A second attempt with the test in j:
DT[,first_available:= ifelse(min(which(!is.na(Petal.Width))),1,0),by=Species]
which just returns a column of 1s. Here, I don't have a good explanation as to why it doesn't work.
my goal is this:
DT[,first_available:=0]
DT[c(11,71,135),first_available:=1]
but in reality I have hundreds of groups. Any help would be appreciated!
Edit: this question does come close but is not targeted at NA's and does not solve the issue here if I understand it correctly. I tried:
DT <- data.table(DT, key = c('Species'))
DT[unique(DT[,key(DT), with = FALSE]), mult = 'first']
Here's one way:
DT[!is.na(Petal.Width), first := as.integer(seq_len(.N) == 1L), by = Species]
We can try
DT[DT[, .I[which.max(!is.na(Petal.Width))] , Species]$V1,
first_available := 1][is.na(first_available), first_available := 0]
Or a slightly more compact option is
DT[, first_available := as.integer(1:nrow(DT) %in%
DT[, .I[!is.na(Petal.Width)][1L], by = Species]$V1)][]
> DT[!is.na(DT$Petal.Width) & DT$first_available == 1]
# Petal.Width Species first_available
# 1: 0.2 setosa 1
# 2: 1.8 versicolor 1
# 3: 1.4 virginica 1
> rownames(DT)[!is.na(DT$Petal.Width) & DT$first_available == 1]
# [1] "11" "71" "135"
> rownames(DT)[!is.na(DT$Petal.Width) & DT$first_available == 0]
# [1] "12" "13" "14" "16" "17" "18" "19" "20" "21" "22" "23" "24"
# [13] "25" "26" "27" "28" "29" "30" "31" "32" "33" "34" "35" "36"
# [25] "37" "38" "39" "40" "41" "42" "43" "44" "72" "73" "74" "75"
# [37] "76" "77" "78" "79" "80" "81" "82" "83" "84" "85" "86" "87"
# [49] "88" "89" "90" "91" "92" "93" "94" "95" "96" "97" "98" "99"
# [61] "100" "136" "137" "138" "139" "140" "141" "142" "143" "144" "145" "146"
# [73] "147" "148" "149" "150"

Resources