I have the following data frame
school<-c("NYU", "BYU", "USC")
state<-c("NY","UT","CA")
measure<-c("MSAT","MSAT","GPA")
score<-c(500, 490, 2.9)
score2<-(c(200, 280, 4.3))
df<-data.frame(school,state, measure,score,score2, stringsAsFactors=FALSE)
> df
school state measure score score2
1 NYU NY MSAT 500.0 200.0
2 BYU UT MSAT 490.0 280.0
3 USC CA GPA 2.9 4.3
And I would like to set all the values for certain columns to NA without any condition. Just set them to NA. i.e.
> df
school state measure score score2
1 NYU NA MSAT NA NA
2 BYU NA MSAT NA NA
3 USC NA GPA NA NA
I have tried:
df <- mutate_at(vars(-school,-measure),na_if(.,!is.na(.)))
Where I was expecting na_if(.,!is.na(.)) to convert any value that wasn't already NA to NA. But as you can see I'm not correctly feeding the columns into the is.na() function.
Error in length_x %in% c(1L, n) : object '.' not found
How would I go about achieving this. I have many more columns I would like the perform this on than columns I want to preserve.
This would do it
df[,c("state","score","score2")]<-NA
Since you're asking specifically for mutate, here are some things to consider:
In your line df <- mutate_at(vars(-school,-measure),na_if(.,!is.na(.))), it fails because it expects df as the first argument - or piped in. The correct usage would be
df <- df %>% mutate_at(vars(-school,-measure),na_if(.,!is.na(.)))
But that doesn't solve it, because
Have you checked what na_if and is.na does? Just a quick ?na_if?? Because it doesn't replace values when the second argument is true, but replaces values equal to the second argument with NA. So, that just plainly doesn't work as expected.
And finally,
Why only change non-NA values to NA? Why not just change everything to NA?
Which leads to the following solution:
just.na <- function(x) rep(NA, length(x))
df %>% mutate_at(vars(-school, -measure), just.na)
Or, an anonymous function:
df %>% mutate_at(vars(-school, -measure), ~rep(NA, length(.)))
Or, it turns out you can do this:
df %>% mutate_at(vars(-school, -measure), ~NA)
(I am equally surprised!)
Try this:
df[,c(2,4,5)] <- NA
Related
I am trying to get a row 'most_lost' out of the Titanic df.
I created a new variable (most_lost) and want the row of information out of the Titanic df. most_lost <- unlist(titanic_df[max("Freq"), ])
I have tried multiple approaches and every time I run it the information comes back NA
most_lost <- unlist(titanic_df[max("Freq"), ]) with this line of code my data
come out
Class Sex Age Survived Freq
NA NA NA NA NA
I want it to come out
Crew Male Adult No 670
I have tried
most_lost <- titanic_df[max("Freq"), ]
and it still returns NA
If you created a variable, then it is going to be a column not a row.
Try
max(titanic_df[, "Freq"])
Edit
OK, in case you want to get the row...
titanic_df[titanic_df$Freq == max(titanic_df$Freq), ]
Or using tidyverse:
library(tidyverse)
titanic_df %>%
filter(Freq == max(Freq))
or you can order by freq and take the first row,
titanic_df %>%
arrange(desc(Freq)) %>%
head(1)
I have a column variable that I want to split into three factor variables. There are the factor variables I want to create:
goal<-c('newref', 'meow', 'woof')
area<-c('eco', 'social', 'bank')
fr<-c('demo', 'hist', 'util')
And the current variable looks more or less like that:
code<-c('goal\\\\meow', 'area\\\\bank', 'area\\\\bank', 'fr\\\\utilitarian', 'fr\\\\history')
And let's say the dataframe is something like that
df<-data.frame(var1=c(1,2,3,4,5), var2=c('a', 'b', 'c', 'd', 'e'), code=code)
So I would like to create 3 new columns, one per each factor variable, and use a regular expression that detected what it belongs to. So for example row number one should look as follows:
row1<-data.frame(var1=1, var2=c('a'), code=c('goal\\\\meow'), goal=2, area=NA, fr=NA)
Also note that the value of the factor variables is an abbreviation of the value in code (eg, history / hist).
The database is likely to have 10000 entries, so I would really appreciate any hints on this.
Thank you!
We can define a function that finds the position of the factor variable that, when used as a regular expression, finds a match in the code column:
find_match <- function(code, matches) {
apply(sapply(matches, grepl, code), 1, match, x=T)
}
If there is no match, this function returns NA for that row.
Next, we can simply use mutate from dplyr to add each column of factors:
df %>% mutate(goal = find_match(code, goal),
area = find_match(code, area),
fr = find_match(code, fr))
Which gives:
var1 var2 code goal area fr
1 1 a goal\\\\meow 2 NA NA
2 2 b area\\\\bank NA 3 NA
3 3 c area\\\\bank NA 3 NA
4 4 d fr\\\\utilitarian NA NA 3
5 5 e fr\\\\history NA NA 2
Doing this with tidyverse tools like the pipe %>% and dplyr:
Separate breaks up the code column into two with the separator you specify.
Because "\" is a special character in regex you have to escape each \ you want to look for with another .
Spread converts it from tall form to wide form as you needed.
library(dplyr)
df %>%
separate(code, into = c("colName", "value"), sep = "\\\\\\\\") %>%
spread(colName, value)
I am trying to bring multiple things together using dplyr: Given I have a time series of multiple returns, I want to calculate the average correlation (I simplified my real task to give the easiest possible example) of all returns with all of the other returns. Of course (in contrast to the example below) my real dataset is rather large (and not yet spread(stock,ret)) contains multiple NAs. Also, in a second step I would have to create my own function and supply that to rollapply. Therefore, if you have a suggestion using something from the RCpproll-package I would be more than happy!
In the below example you can see that I need to input all columns at once, select a window, apply a function to all columns simultaneously, receive a vector with the same number of columns and so on...
Here is my example:
df <- data.frame(Date =as.Date("1926-01-01")+1:24,
PERMNO1 = rnorm(24,0.01,0.3),
PERMNO2 = rnorm(24,0.02,0.4),
PERMNO2 = rnorm(24,-0.01,0.6))
df %>%
do(rollapplyr(.[,-1],width=12,function(a) colMeans(cor(a))))
What I would like to get is something like this:
df2 <- df; df2[,2:4]<-NA
for (i in 12:24){
df2[i,2:4] <- colMeans(cor(df[(i-12):i,2:4]))
}
df2
Date PERMNO1 PERMNO2 PERMNO2.1
1926-01-02 NA NA NA
1926-01-03 NA NA NA
1926-01-04 NA NA NA
1926-01-05 NA NA NA
1926-01-06 NA NA NA
1926-01-07 NA NA NA
1926-01-08 NA NA NA
1926-01-09 NA NA NA
1926-01-10 NA NA NA
1926-01-11 NA NA NA
1926-01-12 NA NA NA
1926-01-13 0.14701350 0.2001694 0.3787320
1926-01-14 0.15364347 0.2438042 0.3143516
1926-01-15 0.16118233 0.2549841 0.3266877
1926-01-16 0.04727533 0.2534126 0.3132990
1926-01-17 0.05220443 0.2411095 0.2744379
1926-01-18 0.12252848 0.2461743 0.2766122
1926-01-19 0.08414717 0.2287705 0.2897744
1926-01-20 0.11164866 0.2503174 0.2414130
1926-01-21 0.08886537 0.2604810 0.2621597
1926-01-22 0.14216304 0.2667540 0.2543573
1926-01-23 0.12654902 0.3086711 0.2751671
1926-01-24 0.11068607 0.3019835 0.2728166
1926-01-25 0.06714698 0.2696828 0.2184242
Convert the data frame to a zoo object, run rollapplyr and convert back:
library(dplyr)
library(zoo)
df %>%
read.zoo %>%
rollapplyr(12, function(x) colMeans(cor(x)), by.column = FALSE, fill = NA) %>%
fortify.zoo
The last line could be omitted if you want to just keep the answer as a zoo object which would probably be more convenient than representing a time series as a data frame.
I am trying to filter out NA, NaN and Inf values out of a tbl using dyplr's filter function.
The trick is that I only want to apply the filter to columns whose names contain a specific pattern. The pattern is: r1, r2, r3, etc.
I have tried to combine grep and filter to achieve this, but can't get it to work. My current code looks like this:
filter_(!is.na(grep("r[1-9]", colnames(DF), value = TRUE))
& !is.infinite(grep("r[1-9]", colnames(DF), value = TRUE))
& !is.nan(grep("r[1-9]", colnames(DF), value = TRUE)))
However, this code returns a warning message: "Truncating vector to length 1."
And the data returned is unfiltered.
I suspect that it's the is.na functions here that are causing the problem, because I've seen an example online where you can apply grep to filter using a normal condition (i.e. condition == value) and not a condition based on is.na
dplyr provides matches() that is useful for this
Example 1: How matches() work?
library(dplyr)
# remove columns that start with "mp"
mtcars %>% select(-matches("mp"))
# keep columns that start with "mp"
mtcars %>% select(matches("mp"))
Example 2: Using matches() in the context of your request but using a MWE
# Create a dummy dataset
data = tibble(id = c("John","Paul","George","Ringo"),
r1 = c(1,2,NA,NA),
r2 = c(1,2,NA,4),
s1 = c(1,NA,3,4))
# Filter NAs in columns that start with r followed by a number
data %>% filter_at(vars(matches("r[0-9]")), all_vars(!is.na(.)))
Here is a base R method to filter rows, comparing specific columns.
# sample data
set.seed(1234)
dat <- data.frame(r1=c(NA, 1,NaN, 5, Inf), r2=c(NA, 1,NaN, NA, Inf), d=rnorm(5))
this data set looks like
dat
r1 r2 d
1 NA NA -1.2070657
2 1 1 0.2774292
3 NaN NaN 1.0844412
4 5 NA -2.3456977
5 Inf Inf 0.4291247
We will check the first two columns and ignore the third column. Notice that the only row that should remain is row 2.
dat[Reduce("&", lapply(dat[grep("^r", names(dat))], is.finite)),]
r1 r2 d
2 1 1 0.2774292
Here, a data.frame that is subset using grep to select the appropriate columns (1 and 2) is fed to lapply. The regex "^r" says only include variables whose names that start with "r". In the lapply loop, each vector is checked using is.finite. This function returns FALSE for NA, NaN, and Inf. The resulting list of logical vectors is fed to Reduce` which returns a logical vector the length of the number of rows of the data.frame where an element is TRUE if and only if every element in a row is finite.
With dplyr, you can use the filter_at function:
dat %>% filter_at(vars(matches("^r[1-9]")), all_vars(is.finite(.)))
Using #lmo's sample data, the result is:
r1 r2 d
1 1 1 0.2774292
I stumbled across a peculiar behavior in the lubridate package: dmy(NA) trows an error instead of just returning an NA. This causes me problems when I want to convert a column with some elements being NAs and some date-strings that are normally converted without problems.
Here is the minimal example:
library(lubridate)
df <- data.frame(ID=letters[1:5],
Datum=c("01.01.1990", NA, "11.01.1990", NA, "01.02.1990"))
df_copy <- df
#Question 1: Why does dmy(NA) not return NA, but throws an error?
df$Datum <- dmy(df$Datum)
Error in function (..., sep = " ", collapse = NULL) : invalid separator
df <- df_copy
#Question 2: What's a work around?
#1. Idea: Only convert those elements that are not NAs
#RHS works, but assigning that to the LHS doesn't work (Most likely problem::
#column "Datum" is still of class factor, while the RHS is of class POSIXct)
df[!is.na(df$Datum), "Datum"] <- dmy(df[!is.na(df$Datum), "Datum"])
Using date format %d.%m.%Y.
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = c(NA_integer_, NA_integer_, :
invalid factor level, NAs generated
df #Only NAs, apparently problem with class of column "Datum"
ID Datum
1 a <NA>
2 b <NA>
3 c <NA>
4 d <NA>
5 e <NA>
df <- df_copy
#2. Idea: Use mapply and apply dmy only to those elements that are not NA
df[, "Datum"] <- mapply(function(x) {if (is.na(x)) {
return(NA)
} else {
return(dmy(x))
}}, df$Datum)
df #Meaningless numbers returned instead of date-objects
ID Datum
1 a 631152000
2 b NA
3 c 632016000
4 d NA
5 e 633830400
To summarize, I have two questions: 1) Why does dmy(NA) not work? Based on most other functions I would assume it is good programming practice that every transformation (such as dmy()) of NA returns NA again (just as 2 + NA does)? If this behavior is intended, how do I convert a data.frame column that includes NAs via the dmy() function?
The Error in function (..., sep = " ", collapse = NULL) : invalid separator is being caused by the lubridate:::guess_format() function. The NA is being passed as sep in a call to paste(), specifically at fmts <- unlist(mlply(with_seps, paste)). You can have a go at improving the lubridate:::guess_format() to fix this.
Otherwise, could you just change the NA to characters ("NA")?
require(lubridate)
df <- data.frame(ID=letters[1:5],
Datum=c("01.01.1990", "NA", "11.01.1990", "NA", "01.02.1990")) #NAs are quoted
df_copy <- df
df$Datum <- dmy(df$Datum)
Since your dates are in a reasonably straight-forward format, it might be much simpler to just use as.Date and specify the appropriate format argument:
df$Date <- as.Date(df$Datum, format="%d.%m.%Y")
df
ID Datum Date
1 a 01.01.1990 1990-01-01
2 b <NA> <NA>
3 c 11.01.1990 1990-01-11
4 d <NA> <NA>
5 e 01.02.1990 1990-02-01
To see a list of the formatting codes used by as.Date, see ?strptime