Converting Character Response to "N" over a dataset - r

To start off, and example Dataset :
x <- data.frame(v1=1:5,v2=1:5,v3=1:5,
v4=c("Bob","Green","Curley","Banana","No"),
v5=c("Hello","This question is awful, Mad",NA,"Help","Me"))
I've got a large dataset with a multitude of numeric and character variables (survey data). These responses vary greatly in content and length; the order these variables are in matter, as well. I'm trying to find a way to select all of the character variables in my dataset, and then set any responses to the letter "N"/"Another item" (while leaving the NA values intact).
With the help of other users in the community, I'm able to fill all of these character variables with NA or "N", etc. :
x[,sapply(x, is.character)] <- "N"
But, I would really like to be able to retain those NA values present within the data - Something like this (I'm not very proficient with the apply functions just yet) :
x[ #Contains ANY Text# ,sapply(x, is.character)] <- "NA"
I haven't found anything that will allow me find any and all text within a row/column? It appears something like GREP only works with specific character strings to my knowledge. I'm also unsure of my formatting with the aforementioned function is correct, so please let me know if I'm making an error in placing my #Contains ANY text# argument.
Thanks in advance All!

A data.frame is a list so its columns can be changed using lapply.
Here we can subset x to the character columns, and then lapply over them replacing non-NA values with whatever we want.
x <- data.frame(v1=1:5,v2=1:5,v3=1:5,
v4=c("Bob","Green","Curley","Banana","No"),
v5=c("Hello","This question is awful, Mad",NA,"Help","Me"),
stringsAsFactors = FALSE) # your original data.frame had factors
x
# v1 v2 v3 v4 v5
# 1 1 1 1 Bob Hello
# 2 2 2 2 Green This question is awful, Mad
# 3 3 3 3 Curley <NA>
# 4 4 4 4 Banana Help
# 5 5 5 5 No Me
is_char_col <- sapply(x, is.character)
is_char_col
# v1 v2 v3 v4 v5
# FALSE FALSE FALSE TRUE TRUE
Use replace:
x[is_char_col] <- lapply(x[is_char_col], function(k) replace(k, !is.na(k), "N"))
x
# v1 v2 v3 v4 v5
# 1 1 1 1 N N
# 2 2 2 2 N N
# 3 3 3 3 N <NA>
# 4 4 4 4 N N
# 5 5 5 5 N N
If the replacement logic is actually more complicated, you could modify the anonymous function inside lapply.

Here is a method using a generic function as mentioned by #effel.
x <- data.frame(v1=1:5,v2=1:5,v3=1:5,
v4=c("Bob","Green","Curley","Banana","No"),
v5=c("Hello","This question is awful, Mad",NA,"Help","Me"),
stringsAsFactors = FALSE)
x <- data.frame(lapply(x, function(i) if(is.character(i)) ifelse(!is.na(i), "N", i) else i))

Related

grepl across multiple columns in R

I have the following data which has n.a. values (which R does not recognise)
I am trying to remove these values using grepl
x <- x[!grepl("n.a.", x$Fixed.assets.EUR.Last.avail..yr),]
but I am trying to apply it across all columns instead of specifying each column name and having many lines of text.
What I currently have is
x <- sapply(x[, c(1:4)], !grepl("n.a."))
which produces errors and does not work.
Error in match.fun(FUN) :
'!grepl("n.a.", x[, 1:4])' is not a function, character or symbol
Data
dput(x)[1:6, ]
Fixed.assets.EUR.Last.avail..yr Fixed.assets.EUR.Year...1 Fixed.assets.EUR.Year...2
1 34,827,809 38,549,311 29,035,369
2 755,256 658,200 573,888
3 2,639,824 2,739,205 3,230,890
4 2,543,367 2,317,132 2,994,769
5 1,608,004 1,702,838 1,763,244
6 661,875 661,082 584,166
Fixed.assets.EUR.Year...3
1 30,416,099
2 n.a.
3 2,841,046
4 693,370
5 2,024,666
6 565,007
Let me start by saying that the best practice here would be to specify a na.strings = c("n.a.") argument when you read in your data. That said, this is a way to use grepl() to remove any row where you have n.a. as a string.
x[-which(apply(x[,1:4],1,function(y) any(grepl("n.a.",y, fixed=TRUE)))),]
If you want R to recognize "n.a." as NA values without removing the entire row (and hence losing real values across a row with an n.a. value in only one column), you can use this:
df[df=="n.a."] <- NA
Otherwise, you are better off using #Mako212's solution.
Here are 2 alternative options
Example Data
set.seed(1)
df <- as.data.frame(matrix(sample(c("n.a.", "good"), 20, replace=TRUE), ncol=2, byrow=TRUE))
head(df)
# V1 V2
# 1 n.a. n.a.
# 2 good good
# 3 n.a. good
# 4 good good
# 5 good n.a.
# 6 n.a. n.a.
Convert n.a. to NA, then use complete.cases
data <- replace(df, df == "n.a.", NA)
data[complete.cases(data),]
# V1 V2
# 2 good good
# 4 good good
# 9 good good
Use rowSums
df[rowSums(df == "n.a.") == 0,]
# V1 V2
# 2 good good
# 4 good good
# 9 good good

How do I check if subgroups of a character column in R are different?

I have some columns of characters such as:
V1 V2 group
B C 1
B C 1
B C 1
A C 2
A A 2
A A 2
in a data frame (call it df) in R which are also grouped by a factor with 2 levels 1 and 2, and I wanted to use
'by' or 'lapply' to see if I could work out which column(s) had a corresponding group structure which is given by group. In this case, the answer would be column V1.
I was thinking something like
by(df, df$group,...)
but wasn't quite sure how to implement this. I've also seen the 'identical' function but didn't know if the opposite was available?
Thanks for any advice!
may be
sapply(df[,1:2], function(x) all(as.numeric(factor(x,
levels=unique(x)))==df$group))
# V1 V2
#TRUE FALSE
Or for this example
!colSums((df[,1:2]=='A')+1!=df$group)
# V1 V2
#TRUE FALSE
Or you could use
!rowSums(aggregate(.~ group, df, FUN=function(x) length(unique(x)))[,-1]!=1)
#[1] TRUE FALSE

Why as.data.frame doing this in R programming?

First of all i would like to tell that I am new to R programming. I was doing some experiment on some R code. I am facing some strange behaviour that I do not expect. I think some one can help me to figure it out.
I ran the following code to read data from a CSV file:
normData= read.csv("normData.csv");
and my normData looks like:
But When I ran the following code to form a Data Frame:
datExpr0 = as.data.frame(t(normData));
I get the following data:
Can some one please tell me, from where the an extra raw (v1,v2,v3,v4,v5,v6) coming from?
Try using:
setNames(as.data.frame(t(normData[-1])), normData[[1]])
However, it might be better to see if you can use the row.names argument in read.table to directly read your "X" as the row names. Then you should be able to directly use as.data.table(t(...)).
Here's a small example to show what's happening:
Start with a data.frame with characters as the first column:
df <- data.frame(A = letters[1:3],
B = 1:3, C = 4:6)
df
# A B C
# 1 a 1 4
# 2 b 2 5
# 3 c 3 6
When you transpose the entire thing, you also transpose that first column (thereby also creating a character matrix).
as.data.frame(t(df))
# V1 V2 V3
# A a b c
# B 1 2 3
# C 4 5 6
So, we drop the column first, and use the values from the column to replace the "V1", "V2"... names.
setNames(as.data.frame(t(df[-1])), df[[1]])
# a b c
# B 1 2 3
# C 4 5 6

R: Properly using a dataframe as an argument to a function

I am practicing using the apply function in R, and so I'm writing a simple function to apply to a dataframe.
I have a dataframe with 2 columns.
V1 V2
1 3
2 4
I decided to do some basic arithmetic and have the answer in the 3rd column, specifically, I want to multiply the first column by 2 and the second column by 3, then sum them.
V1 V2 V3
1 3 11
2 4 16
Here's what I was thinking:
mydf <- as.data.frame(matrix(c(1:4),ncol=2,nrow=2))
some_function <- function(some_df) {some_df[,1]*2 +
some_df[,2]*3}
mydf <- apply(mydf ,2, some_function)
But what is wrong with my arguments to the function? R is giving me an error regarding the dimension of the dataframe. Why?
Three things wrong:
1) apply "loops" a vector of either each column or row, so you just address the name [1] not [,1]
2) you need to run by row MARGIN=1, not 2
3) you need to cbind the result, because apply doesn't append, so you're overwriting the vector
mydf <- as.data.frame(matrix(c(1:4),ncol=2,nrow=2))
some_function <- function(some_df) {some_df[1]*2 +
some_df[2]*3}
mydf <- cbind(mydf,V3=apply(mydf ,1, some_function))
# V1 V2 V3
#1 1 3 11
#2 2 4 16
but probably easier just to do the vector math:
mydf$V3<-mydf[,1]*2 + mydf[,2]*3
because vector math is one of the greatest things about R

Extract data elements found in a single column

Here is what my data look like.
id interest_string
1 YI{Z0{ZI{
2 ZO{
3 <NA>
4 ZT{
As you can see, can be multiple codes concatenated into a single column, seperated by {. It is also possible for a row to have no interest_string values at all.
How can I manipulate this data frame to extract the values into a format like this:
id interest
1 YI
1 Z0
1 ZI
2 Z0
3 <NA>
4 ZT
I need to complete this task with R.
Thanks in advance.
This is one solution
out <- with(dat, strsplit(as.character(interest_string), "\\{"))
## or
# out <- with(dat, strsplit(as.character(interest_string), "{", fixed = TRUE))
out <- cbind.data.frame(id = rep(dat$id, times = sapply(out, length)),
interest = unlist(out, use.names = FALSE))
Giving:
R> out
id interest
1 1 YI
2 1 Z0
3 1 ZI
4 2 ZO
5 3 <NA>
6 4 ZT
Explanation
The first line of solution simply splits each element of the interest_string factor in data object dat, using \\{ as the split indicator. This indicator has to be escaped and in R that requires two \. (Actually it doesn't if you use fixed = TRUE in the call to strsplit.) The resulting object is a list, which looks like this for the example data
R> out
[[1]]
[1] "YI" "Z0" "ZI"
[[2]]
[1] "ZO"
[[3]]
[1] "<NA>"
[[4]]
[1] "ZT"
We have almost everything we need in this list to form the output you require. The only thing we need external to this list is the id values that refer to each element of out, which we grab from the original data.
Hence, in the second line, we bind, column-wise (specifying the data frame method so we get a data frame returned) the original id values, each one repeated the required number of times, to the strsplit list (out). By unlisting this list, we unwrap it to a vector which is of the required length as given by your expected output. We get the number of times we need to replicate each id value from the lengths of the components of the list returned by strsplit.
A nice and tidy data.table solution:
library(data.table)
DT <- data.table( read.table( textConnection("id interest_string
1 YI{Z0{ZI{
2 ZO{
3 <NA>
4 ZT{"), header=TRUE))
DT$interest_string <- as.character(DT$interest_string)
DT[, {
list(interest=unlist(strsplit( interest_string, "{", fixed=TRUE )))
}, by=id]
gives me
id interest
1: 1 YI
2: 1 Z0
3: 1 ZI
4: 2 ZO
5: 3 <NA>
6: 4 ZT

Resources