grepl across multiple columns in R - r

I have the following data which has n.a. values (which R does not recognise)
I am trying to remove these values using grepl
x <- x[!grepl("n.a.", x$Fixed.assets.EUR.Last.avail..yr),]
but I am trying to apply it across all columns instead of specifying each column name and having many lines of text.
What I currently have is
x <- sapply(x[, c(1:4)], !grepl("n.a."))
which produces errors and does not work.
Error in match.fun(FUN) :
'!grepl("n.a.", x[, 1:4])' is not a function, character or symbol
Data
dput(x)[1:6, ]
Fixed.assets.EUR.Last.avail..yr Fixed.assets.EUR.Year...1 Fixed.assets.EUR.Year...2
1 34,827,809 38,549,311 29,035,369
2 755,256 658,200 573,888
3 2,639,824 2,739,205 3,230,890
4 2,543,367 2,317,132 2,994,769
5 1,608,004 1,702,838 1,763,244
6 661,875 661,082 584,166
Fixed.assets.EUR.Year...3
1 30,416,099
2 n.a.
3 2,841,046
4 693,370
5 2,024,666
6 565,007

Let me start by saying that the best practice here would be to specify a na.strings = c("n.a.") argument when you read in your data. That said, this is a way to use grepl() to remove any row where you have n.a. as a string.
x[-which(apply(x[,1:4],1,function(y) any(grepl("n.a.",y, fixed=TRUE)))),]

If you want R to recognize "n.a." as NA values without removing the entire row (and hence losing real values across a row with an n.a. value in only one column), you can use this:
df[df=="n.a."] <- NA
Otherwise, you are better off using #Mako212's solution.

Here are 2 alternative options
Example Data
set.seed(1)
df <- as.data.frame(matrix(sample(c("n.a.", "good"), 20, replace=TRUE), ncol=2, byrow=TRUE))
head(df)
# V1 V2
# 1 n.a. n.a.
# 2 good good
# 3 n.a. good
# 4 good good
# 5 good n.a.
# 6 n.a. n.a.
Convert n.a. to NA, then use complete.cases
data <- replace(df, df == "n.a.", NA)
data[complete.cases(data),]
# V1 V2
# 2 good good
# 4 good good
# 9 good good
Use rowSums
df[rowSums(df == "n.a.") == 0,]
# V1 V2
# 2 good good
# 4 good good
# 9 good good

Related

Combine rows of data frame in R using colMeans?

I'm impressed by the number of "how to combine rows/columns" threads, but even more by the fact that none of these was particularly helpful or at least not applicable to my issue.
My data look like this:
MyData<-data.frame("id" = c("a","a","b"),
"value1_1990" = c(5,NA,1),
"value2_1990" = c(5,NA,2),
"value1_2000" = c(2,1,1),
"value2_2000" = c(2,1,2),
"value1_2010" = c(NA,9,1),
"value2_2010" = c(NA,9,2))
What I want to do is to combine the two rows where id=="a" for columns MyData[,(2:7)] using base R's colMeans.
What it looks like:
id value1_1990 value2_1990 value1_2000 value2_2000 value1_2010 value2_2010
1 a 5 5 2 2 NA NA
2 a NA NA 1 1 9 9
3 b 1 2 1 2 1 2
What I need:
id value1_1990 value2_1990 value1_2000 value2_2000 value1_2010 value2_2010
1 a 5 5 1.5 1.5 9 9
2 b 1 2 1 2 1 2
What I tried (among numerous other things):
MyData[nrow(MyData)+1, 2:7] = colMeans(MyData[which(MyData$id=="a"),(2:7)],na.rm=T) # to combine values from rows where id=="a"
MyData$id<-ifelse(is.na(MyData$id),"NewRow",MyData$id) # to replace "<NA>" in the id-column of the newly created row by "NewRow".
This works, except for the fact that...
...it turns all other existing id's into numeric values (and I don't want to let the second line of code -- the ifelse-statement -- touch any of the existing id's, which is why I wrote else==MyData$id).
...this is not particulary fancy code. Is there a one-line-of-code-solution that does the trick? I saw other approaches using aggregate() but this didn't work for me.
You can try using dplyr:
library(dplyr)
Possible solution:
MyData %>% group_by(id) %>% summarise_all(funs(mean(., na.rm = TRUE)))

Transform longitudinal table to wide format efficiently in data.table

I am working in R with a long table stored as a data.table containing values obtained in value changes for variables of numeric and character type. When I want to perform some functions like correlations, regressions, etc. I have to convert the table into wide format and homogenise the timestamp frequency.
I found a way to convert the long table to wide, but I think is not really efficient and I would like to know if there is a better more data.table native approach.
In the reproducible example below, I include the two options I found to perform the wide low transformation and in the comments I indicate what parts I believe are not optimal.
library(zoo)
library(data.table)
dt<-data.table(time=1:6,variable=factor(letters[1:6]),numeric=c(1:3,rep(NA,3)),
character=c(rep(NA,3),letters[1:3]),key="time")
print(dt)
print(dt[,lapply(.SD,typeof)])
#option 1
casted<-dcast(dt,time~variable,value.var=c("numeric","character"))
# types are correct, but I got NA filled columns,
# is there an option like drop
# available for columns instead of rows?
print(casted)
print(casted[,lapply(.SD,typeof)])
# This drop looks ugly but I did not figure out a better way to perform it
casted[,names(casted)[unlist(casted[,lapply(lapply(.SD,is.na),all)])]:=NULL]
# I perform a LOCF, I do not know if I could benefit of
# data.table's roll option somehow and avoid
# the temporal memory copy of my dataset (this would be the second
# and minor issue)
casted<-na.locf(casted)
#option2
# taken from http://stackoverflow.com/questions/19253820/how-to-implement-coalesce-efficiently-in-r
coalesce2 <- function(...) {
Reduce(function(x, y) {
i <- which(is.na(x))
x[i] <- y[i]
x},
list(...))
}
casted2<-dcast(dt[,coalesce2(numeric,character),by=c("time","variable")],
time~variable,value.var="V1")
# There are not NA columns but types are incorrect
# it takes more space in a real table (more observations, less variables)
print(casted2)
print(casted2[,lapply(.SD,typeof)])
# Again, I am pretty sure there is a prettier way to do this
numericvars<-names(casted2)[!unlist(casted2[,lapply(
lapply(lapply(.SD,as.numeric),is.na),all)])]
casted2[,eval(numericvars):=lapply(.SD,as.numeric),.SDcols=numericvars]
# same as option 1, is there a data.table native way to do it?
casted2<-na.locf(casted2)
Any advice/improvement in the process is welcome.
I'd maybe do the char and num tables separately and then rbind:
k = "time"
typecols = c("numeric", "character")
res = rbindlist(fill = TRUE,
lapply(typecols, function(tc){
cols = c(k, tc, "variable")
dt[!is.na(get(tc)), ..cols][, dcast(.SD, ... ~ variable, value.var=tc)]
})
)
setorderv(res, k)
res[, setdiff(names(res), k) := lapply(.SD, zoo::na.locf, na.rm = FALSE), .SDcols=!k]
which gives
time a b c d e f
1: 1 1 NA NA NA NA NA
2: 2 1 2 NA NA NA NA
3: 3 1 2 3 NA NA NA
4: 4 1 2 3 a NA NA
5: 5 1 2 3 a b NA
6: 6 1 2 3 a b c
Note that OP's final result casted2, differs in that it has all cols as char.

How to conditionally combine data.frame object in the list in more elegant way?

I have data.frame in the list, and I intend to merge specific data.frame objects conditionally where merge second, third data.frame objects without duplication, then merge it with first data.frame objects. However, I used rbind function to do this task, but my approach is not elegant. Can anyone help me out the improve the solution ? How can I achieve more compatible solution that can be used in dynamic functional programming ? How can I get desired output ? Any idea ?
reproducible example:
dfList <- list(
DF.1 = data.frame(red=c(1,2,3), blue=c(NA,1,2), green=c(1,1,2)),
DF.2 = data.frame(red=c(2,3,NA), blue=c(1,2,3), green=c(1,2,4)),
DF.3 = data.frame(red=c(2,3,NA,NA), blue=c(1,2,NA,3), green=c(1,2,3,4))
)
dummy way to do it:
rbind(dfList[[1L]], unique(rbind(dfList[[2L]], dfList[[3L]])))
Apparently, my attempt is not elegant to apply in functional programming. How can make this happen elegantly ?
desired output :
red blue green
1 1 NA 1
2 2 1 1
3 3 2 2
11 2 1 1
21 3 2 2
31 NA 3 4
6 NA NA 3
How can I improve my solution more elegantly and efficiently ? Thanks in advance
The best (easiest and fastest way) to do this is data.table::rbindlist.
It would work like this:
library(data.table)
dfList <- list(
DF.1 = data.table(red=c(1,2,3), blue=c(NA,1,2), green=c(1,1,2)),
DF.2 = data.table(red=c(2,3,NA), blue=c(1,2,3), green=c(1,2,4)),
DF.3 = data.table(red=c(2,3,NA,NA), blue=c(1,2,NA,3), green=c(1,2,3,4))
)
# part 1: list element 1
dt_1 <- dfList[[1]]
# part 2: all other list elements (in your case 2 and 3)
dt_2 <- unique(rbindlist(dfList[-1]))
# use rbindlist to bind the rows together
dt_all <- rbindlist(list(dt_1, dt_2))
Comment.
My solution is pretty close to your proposed solution. I think the "ugliness" about this way is that it is an edge case to merge datasets and deattach the first element (and treat it in a different way). The best solution would probably be to step back and think about the underlying idea and solve it using an additional variable in the datasets (i.e., for df1 and then for df2_3), which I would consider the R-way.
Something along this thought would look like this:
myList2 <- list(
DF.1 = data.table(red=c(1,2,3), blue=c(NA,1,2), green=c(1,1,2), var = "df1"),
DF.2 = data.table(red=c(2,3,NA), blue=c(1,2,3), green=c(1,2,4), var = "other"),
DF.3 = data.table(red=c(2,3,NA,NA), blue=c(1,2,NA,3), green=c(1,2,3,4), var = "other")
)
dt <- rbindlist(myList2)
unique(dt)
# red blue green var
# 1: 1 NA 1 df1
# 2: 2 1 1 df1
# 3: 3 2 2 df1
# 4: 2 1 1 other
# 5: 3 2 2 other
# 6: NA 3 4 other
# 7: NA NA 3 other
A way of rbinding a list of data.frames with only base R is do.call(list, rbind) (see this question that also presents some alternatives).
If you then desire only unique rows you can follow-up with a unique
unique(do.call(dfList, rbind))

Converting Character Response to "N" over a dataset

To start off, and example Dataset :
x <- data.frame(v1=1:5,v2=1:5,v3=1:5,
v4=c("Bob","Green","Curley","Banana","No"),
v5=c("Hello","This question is awful, Mad",NA,"Help","Me"))
I've got a large dataset with a multitude of numeric and character variables (survey data). These responses vary greatly in content and length; the order these variables are in matter, as well. I'm trying to find a way to select all of the character variables in my dataset, and then set any responses to the letter "N"/"Another item" (while leaving the NA values intact).
With the help of other users in the community, I'm able to fill all of these character variables with NA or "N", etc. :
x[,sapply(x, is.character)] <- "N"
But, I would really like to be able to retain those NA values present within the data - Something like this (I'm not very proficient with the apply functions just yet) :
x[ #Contains ANY Text# ,sapply(x, is.character)] <- "NA"
I haven't found anything that will allow me find any and all text within a row/column? It appears something like GREP only works with specific character strings to my knowledge. I'm also unsure of my formatting with the aforementioned function is correct, so please let me know if I'm making an error in placing my #Contains ANY text# argument.
Thanks in advance All!
A data.frame is a list so its columns can be changed using lapply.
Here we can subset x to the character columns, and then lapply over them replacing non-NA values with whatever we want.
x <- data.frame(v1=1:5,v2=1:5,v3=1:5,
v4=c("Bob","Green","Curley","Banana","No"),
v5=c("Hello","This question is awful, Mad",NA,"Help","Me"),
stringsAsFactors = FALSE) # your original data.frame had factors
x
# v1 v2 v3 v4 v5
# 1 1 1 1 Bob Hello
# 2 2 2 2 Green This question is awful, Mad
# 3 3 3 3 Curley <NA>
# 4 4 4 4 Banana Help
# 5 5 5 5 No Me
is_char_col <- sapply(x, is.character)
is_char_col
# v1 v2 v3 v4 v5
# FALSE FALSE FALSE TRUE TRUE
Use replace:
x[is_char_col] <- lapply(x[is_char_col], function(k) replace(k, !is.na(k), "N"))
x
# v1 v2 v3 v4 v5
# 1 1 1 1 N N
# 2 2 2 2 N N
# 3 3 3 3 N <NA>
# 4 4 4 4 N N
# 5 5 5 5 N N
If the replacement logic is actually more complicated, you could modify the anonymous function inside lapply.
Here is a method using a generic function as mentioned by #effel.
x <- data.frame(v1=1:5,v2=1:5,v3=1:5,
v4=c("Bob","Green","Curley","Banana","No"),
v5=c("Hello","This question is awful, Mad",NA,"Help","Me"),
stringsAsFactors = FALSE)
x <- data.frame(lapply(x, function(i) if(is.character(i)) ifelse(!is.na(i), "N", i) else i))

Reshape data frame from wide to long with re-occuring column names in R

I'm trying to convert a data frame from wide to long format using the melt formula. The challenge is that I have multiple column names that are labeled the same. When I use the melt function, it drops the values from the repeat column. I have read similar questions and it was advised to use the reshape function, however I was not able to get it work.
To reproduce my starting data frame:
conversion.id<-c("1", "2", "3")
interaction.num<-c("1","1","1")
interaction.num2<-c("2","2","2")
conversion.id<-as.data.frame(conversion.id)
interaction.num<-as.data.frame(interaction.num)
interaction.num2<-as.data.frame(interaction.num2)
conversion<-c(rep("1",3))
conversion<-as.data.frame(conversion)
df<-cbind(conversion.id,interaction.num, interaction.num2, conversion)
names(df)[3]<-"interaction.num"
The data frame looks like the following:
When I run the following melt function:
melt.df<-melt(df,id="conversion.id")
It drops the interaction.num == 2 column and looks something like this:
The data frame I want is the following:
I saw the following post, but I'm not too familiar with the reshape function and wasn't able to get it to work.
How to reshape a dataframe with "reoccurring" columns?
And to add a layer of complexity, I'm looking for a method that is efficient. I need to perform this on a data frame that is around a 1M rows with many columns labeled the same.
Any advice would be greatly appreciated!
Here is a solution using tidyr instead of reshape2. One of the advantages is the gather_ function, which takes character vectors as inputs. So, first we can replace all the "problematic" variable names with unique names (by adding numbers to the end of each name) and then we can gather (the equivalent of melt) these specific variables. The unique names of the variables are stored in a temporary variable called "prob_var_name", which I removed at the end.
library(tidyr)
library(dplyr)
var_name <- "interaction.num"
problem_var <- df %>%
names %>%
equals(var_name) %>%
which
replaced_names <- mapply(paste0,names(df)[problem_var],seq_along(problem_var))
names(df)[problem_var] <- replaced_names
df %>%
gather_("prob_var_name",var_name,replaced_names) %>%
select(-prob_var_name)
conversion.id conversion interaction.num
1 1 1 1
2 2 1 1
3 3 1 1
4 1 1 2
5 2 1 2
6 3 1 2
Thanks to the quoting ability of gather_, you could wrap all this into a function and set var_name to a variable. Then maybe you could use it on all of your duplicated variables?
Here's a solution using data.table. You just have to provide the index instead of names.
require(data.table)
require(reshape2)
ans <- melt(setDT(df), measure=2:3,
value.name="interaction.num")[, variable := NULL]
# conversion.id conversion interaction.num
# 1: 1 1 1
# 2: 2 1 1
# 3: 3 1 1
# 4: 1 1 2
# 5: 2 1 2
# 6: 3 1 2
You can get the indices 2:3 by doing grep("interaction.num", names(df)).
Here's an approach in base R that should work for you:
x <- grep("interaction.num", names(df)) ## as suggested by Arun
## Make more friendly names for reshape
names(df)[x] <- paste(names(df)[x], seq_along(x), sep = "_")
## Reshape
reshape(df, direction = "long",
idvar=c("conversion.id", "conversion"),
varying = x, sep = "_")
# conversion.id conversion time interaction.num
# 1.1.1 1 1 1 1
# 2.1.1 2 1 1 1
# 3.1.1 3 1 1 1
# 1.1.2 1 1 2 2
# 2.1.2 2 1 2 2
# 3.1.2 3 1 2 2
Another possibility is stack instead of reshape:
x <- grep("interaction.num", names(df)) ## as suggested by Arun
cbind(df[-x], stack(lapply(df[x], as.character)))
The lapply(df[x], as.character) may not be necessary depending on if your values are actually numeric or not. The way you created this sample data, they were factors.

Resources