I'm trying to identify lines with a missing date between two dates.
data.table initial
i want
I want to delete the columns with only "NA" in them (dt_7 and dt_8).
Perhaps you are looking for something like
df <- data.frame(dt_1 = 1:10, dt_2 = c(1, NA, 2, 3, NA, 6:10), dt_3 = rep(NA, 10))
df[,-(which(colSums(is.na(df))==dim(df)[1]))]
or
df %>% select_if(colSums(is.na(.))!=dim(df)[1])
The first option doesn't work for data.tables. Sorry, but the second one should solve your problem.
Related
I have a dataframe df and I wish to create a new column b that is the smaller value of column a and 10 - a. When there is NA, I wish column b also returnsNA in the corresponding rows. So column b should be c(1, 3, 1, NA). I tried the following code but all rows of b are 1. I wish to find a solution in tidyverse.
library(tidyverse)
df <- data.frame(a = c(1, 3, 9, NA))
df2 <- df %>% mutate(b = min(a, 10 - a, na.rm = T))
I guess the issue arises becuase of applying the min function, which is complicated by the presence of NA. But I cannot figure out how to solve the issue.
filter(data, function(x) sum(is.na(x))>2)
I want to use this piece of code to get a subset of data which contains rows involve less than 2 NA values, however, the error occurs:
Error: Argument 2 filter condition does not evaluate to a logical vector
Just wondering the reason and how could I deal with it? Many thanks!
We can use filter with rowSums
library(dplyr)
data %>%
filter(rowSums(is.na(.)) < 2)
Or using base R
data[rowSums(is.na(data)) < 2,]
data
data <- data.frame(col1 = c(2, 3, NA), col2 = c(2, NA, NA), col3 = c(1, 2, 3))
The following code is designed to extract the first x observations of each column, which are time series spanning different periods. (or to erase everything else than the x first values in each column …)
The first values, can be numbers followed by NAs, as long as it is the beginning of the time series.
This is crucial that each value stay linked to its own place in the indexing (the first column 'Year')
# data example
df <- data.frame("Year" = 1791:1800,
"F1" = c(NA, NA, NA, 1.2,1.3, NA, NA, NA, NA, NA),
"F2" = c(NA, NA, 2.1, 2.2, 2.3, 2.4, 2.5, NA, NA, NA),
"F3" = c(NA, NA, NA, NA, NA, 0.1,0.2,0.3,0.4,0.5),
"F4" = c(NA, 3.1,3.2,3.3,3.4,3.5,3.6,3.7,3.8,3.9))
# Convert the dataframe to a list by column
long <- setNames(lapply(names(df)[-1], function(x) cbind(df[1], df[x])), names(df)[-1])
# and select only the first 3 elements after NAs in each column
mylist <- lapply(long, function(x){
head(na.omit(x), 3)
})
# or in a more concise writing ??
mylist2 <- lapply(df, function(x){
head(na.omit(cbind(df[[1]],x)), 3)
})
# Now ‘mylist’ (or ‘mylist2’) contains several vector of different lengths,
# not very appropriate for dataframe, let's switch to long format dataframe
mydata <- do.call(rbind, lapply(mylist, function(x){
require(reshape2)
melt(x, id.vars="Year")
})
)
# and switch back to regular spreadsheet format
library(tidyverse)
mydataCOL <- spread(mydata, key = "variable", value = "value")
write.table(mydataCOL, “sheet1.txt”)
This thing is complicated to apply to a list of dataframe (multiple excel files). Is there an easier way to achieve this ? To do such operations on each column of each dataframe of the list :)
I'm currently trying with 'nested' lapply() :
mylist <- lapply(d, function(x){
lapply(x, function(y){
head(na.omit(cbind(x[[1]],y)), 50)
})
})
but this is not the easiest way I guess... Thanks !
If you are using the tidyverse anyway, why not go all in with Hadley's stuff?
GetTop <- function(indf){
indf %>%
pivot_longer(-Year,names_to="F") %>%
na.omit() %>%
group_by(F) %>%
top_n(3,wt=-Year) %>%
pivot_wider(names_from="F")
}
Now if we can call it for one dataframe
> mytops <- GetTop(df)
If you have a list of these dataframes you can use lapply to do this to each one.
allmytop <- lapply(biglist,FUN=GetTop)
That will give you a list of dataframes. Seems like you also want to join them into one fat dataframe.
fatdf <- lapply(biglist,FUN=GetTop) %>% reduce(full_join,by="Year")
I have a data set with several columns and I'm working on it using R. Most of those columns have missing data, which was set as a value -200. What I want to do is to delete all the rows that have -200 in any of the columns. Is there an easy way to do this other than going by each column at a time? Can I delete all rows that a value of -200 all at once?
Thank you for your time!
A tidyverse option would be
library(tidyverse)
df %>%
filter_all(all_vars(. != -200))
data
df <- data.frame(v1 = c(-200, 1, 2, 3), v2 = c(1, -200, 2, 4))
You can use rowSums(), i.e.
df[rowSums(df == -200) == 0,]
I have a string variable in dataframe and want to delete some rows that contain strings like "A" or "B". I used these codes but they didn't work :
isna=apply(DATA[1], 2, function(x)x!="A"|"B")
isna=apply(DATA[1], 2, function(x)x!="A"||"B")
Is there a reason you need to use apply?
DATA <- data.frame(code=sample(LETTERS[1:5],10, replace = TRUE))
subset(DATA, code!="A" & code!="B")
if I understood what you need correctly, then this is also an option:
library(dplyr)
# an exemplary dataframe
df <- data.frame(col1 = sample(LETTERS[1:5], 20, replace = TRUE),
col2 = 1:20)
df
# the filter for choosing the rows
filter(df, !col1 %in% c("A", "B"))
isna=apply(DATA[1], 2, function(x)(x!="A")&(x!="B"))
DATA <- DATA[isna,]