replace asterisks in dataframe with NA's - r

Here's my dataframe df
I'm trying:
df=data.frame(rbind(c(1,"*","*"),c("*",3,"*"))
df2=as.data.frame(sapply(df,sub,pattern="*",replacement="NA"))
It doesn't work because of the asterisk but I'm getting mad trying to replace it.

If you just have * in (meaning its not like ab*de) your data.frame, then, you can do ths without regex:
df[df == "*"] <- NA

Both solutions here address an object already in your workplace. If possible (or at least in the future) you can make use of the na.strings argument in read.table. Notice that it is plural "strings", so you should be able to specify more than one character to treat as NA values.
Here's an example: This just writes a file named "readmein.txt" to your current working directory and verifies that it is there.
cat("V1 V2 V3 V4 V5 V6 V7\n
2 * * * * * 2\n
1 2 * * * * 1\n", file = "readmein.txt")
list.files(pattern = "readme")
# [1] "readmein.txt"
Here's read.table with the na.strings argument in action.
read.table("readmein.txt", na.strings="*", header = TRUE)
# V1 V2 V3 V4 V5 V6 V7
# 1 2 NA NA NA NA NA 2
# 2 1 2 NA NA NA NA 1
Update: Objects already in your workplace
I see another problem with the other two answers: They both result in character (or rather factor) variables, even when the column should have possibly been numeric.
Here's an example. First, we create an example dataset. For fun, I've added another character to be treated as NA: ".".
temp <- data.frame(
V1 = c(1:3),
V2 = c(1, "*", 3),
V3 = c("a", "*", "c"),
V4 = c(".", "*", "3"))
temp
# V1 V2 V3 V4
# 1 1 1 a .
# 2 2 * * *
# 3 3 3 c 3
str(temp)
# 'data.frame': 3 obs. of 4 variables:
# $ V1: int 1 2 3
# $ V2: Factor w/ 3 levels "*","1","3": 2 1 3
# $ V3: Factor w/ 3 levels "*","a","c": 2 1 3
# $ V4: Factor w/ 3 levels ".","*","3": 1 2 3
Let's make a copy, and then solve this in what I would consider the most obvious "R" way:
temp1 <- temp
temp1[temp1 == "*"|temp1 == "."] <- NA
Looks OK...
temp1
# V1 V2 V3 V4
# 1 1 1 a <NA>
# 2 2 <NA> <NA> <NA>
# 3 3 3 c 3
... but I presume that V2 and V4 should have been numeric....
str(temp1)
# 'data.frame': 3 obs. of 4 variables:
# $ V1: int 1 2 3
# $ V2: Factor w/ 3 levels "*","1","3": 2 NA 3
# $ V3: Factor w/ 3 levels "*","a","c": 2 NA 3
# $ V4: Factor w/ 3 levels ".","*","3": 1 NA 3
Here's a workaround:
temp2 <- read.table(text = capture.output(temp), na.strings = c("*", "."))
temp2
# V1 V2 V3 V4
# 1 1 1 a NA
# 2 2 NA <NA> NA
# 3 3 3 c 3
str(temp2)
# 'data.frame': 3 obs. of 4 variables:
# $ V1: int 1 2 3
# $ V2: int 1 NA 3
# $ V3: Factor w/ 2 levels "a","c": 1 NA 2
# $ V4: int NA NA 3
Update 2: (Yet another) alternative
It might be more appropriate to make use of type.convert which is described as a "helper function for read.table" on its help page. I haven't timed it, but my guess is that it would be faster than the workaround I mentioned above, with all the benefits.
data.frame(
lapply(temp, function(x) type.convert(
as.character(x), na.strings = c("*", "."))))

You should put up a full reproducible example, people will be more inclined to help when you make it easy for em. Anywho...
dat <- data.frame(a=c(1,2,'*',3,4), b=c('*',2,3,4,'*'))
> dat
a b
1 1 *
2 2 2
3 * 3
4 3 4
5 4 *
> as.data.frame(sapply(dat,sub,pattern='\\*',replacement=NA))
a b
1 1 <NA>
2 2 2
3 <NA> 3
4 3 4
5 4 <NA>

This could work (It's a pretty flexible) but there's other great solutions already. Arun's solution is my typical approach but created replacer for new R (little experience with the command line) users. I wouldn't recommend replacer for anyone with even a bit of experience.
library(qdap)
replacer(dat, "*", NA)

Related

Stop R from converting a character factor to number

I am trying to convert missing factor values to NA in a data frame, and create a new data frame with replaced values but when I try to do that, previously character factors are all converted to numbers. I cannot figure out what I am doing wrong and cannot find a similar question. Could anybody please help?
Here are my codes:
orders <- c('One','Two','Three', '')
ids <- c(1, 2, 3, 4)
values <- c(1.5, 100.6, 19.3, '')
df <- data.frame(orders, ids, values)
new.df <- as.data.frame(matrix( , ncol = ncol(df), nrow = 0))
names(new.df) <- names(df)
for(i in 1:nrow(df)){
row.df <- df[i, ]
print(row.df$orders) # "One", "Two", "Three", ""
print(str(row.df$orders)) # Factor
# Want to replace "orders" value in each row with NA if it is missing
row.df$orders <- ifelse(row.df$orders == "", NA, row.df$orders)
print(row.df$orders) # Converted to number
print(str(row.df$orders)) # int or logi
# Add the row with new value to the new data frame
new.df[nrow(new.df) + 1, ] <- row.df
}
and I get this:
> new.df
orders ids values
1 2 1 2
2 4 2 3
3 3 3 4
4 NA 4 1
but I want this:
> new.df
orders ids values
1 One 1 1.5
2 Two 2 100.6
3 Three 3 19.3
4 NA 4
Convert empty values to NA and use type.convert to change their class.
df[df == ''] <- NA
df <- type.convert(df)
df
# orders ids values
#1 One 1 1.5
#2 Two 2 100.6
#3 Three 3 19.3
#4 <NA> 4 NA
str(df)
#'data.frame': 4 obs. of 3 variables:
#$ orders: Factor w/ 4 levels "","One","Three",..: 2 4 3 1
#$ ids : int 1 2 3 4
#$ values: num 1.5 100.6 19.3 NA
Thanks to the hint from Ronak Shah, I did this and it gave me what I wanted.
df$orders[df$orders == ''] <- NA
This will give me:
> df
orders ids values
1 One 1 1.5
2 Two 2 100.6
3 Three 3 19.3
4 <NA> 4
> str(df)
'data.frame': 4 obs. of 3 variables:
$ orders: Factor w/ 4 levels "","One","Three",..: 2 4 3 NA
$ ids : num 1 2 3 4
$ values: Factor w/ 4 levels "","1.5","100.6",..: 2 3 4 1
In case you are curious about the difference between NA and as I was, you can find the answer here.
Your suggestion
df$orders[is.na(df$orders)] <- NA
did not work maybe becasuse missing entry is not NA?

R transposing numeric data.frame results in character variables

I've got a data.frame which contains a character variable and multiple numeric variables, something like this:
sampleDF <- data.frame(a = c(1,2,3,"String"), b = c(1,2,3,4), c= c(5,6,7,8), stringsAsFactors = FALSE)
Which looks like this:
a b c
1 1 1 5
2 2 2 6
3 3 3 7
4 String 4 8
I'd like to transpose this data.frame and get it to look like this:
V1 V2 V3 V4
1 1 2 3 String
2 1 2 3 4
3 5 6 7 8
I tried
c<-t(sampleDF)
as well as
d<-transpose(sampleDF)
but both these methods result in V1, V2 and V3 now being of characer type despite only having numeric values.
I know that this has already been asked multiple times. However, I haven't found a suitable answer for why in this case V1, V2 and V3 are also being converted to character.
Is there any way how ensure that these column stay numeric?
Thanks a lot any apologies already for the duplicate nature of this question.
EDIT:
as.data.frame(t(sampleDF)
Does not solve the problem:
'data.frame': 3 obs. of 4 variables:
$ V1: Factor w/ 2 levels "1","5": 1 1 2
..- attr(*, "names")= chr "a" "b" "c"
$ V2: Factor w/ 2 levels "2","6": 1 1 2
..- attr(*, "names")= chr "a" "b" "c"
$ V3: Factor w/ 2 levels "3","7": 1 1 2
..- attr(*, "names")= chr "a" "b" "c"
$ V4: Factor w/ 3 levels "4","8","String": 3 1 2
..- attr(*, "names")= chr "a" "b" "c"
After transposing it, convert the columns to numeric with type.convert
out <- as.data.frame(t(sampleDF), stringsAsFactors = FALSE)
out[] <- lapply(out, type.convert, as.is = TRUE)
row.names(out) <- NULL
out
# V1 V2 V3 V4
#1 1 2 3 String
#2 1 2 3 4
#3 5 6 7 8
str(out)
#'data.frame': 3 obs. of 4 variables:
# $ V1: int 1 1 5
# $ V2: int 2 2 6
# $ V3: int 3 3 7
# $ V4: chr "String" "4" "8"
Or rbind the first column converted to respective 'types' with the transposed other columns
rbind(lapply(sampleDF[,1], type.convert, as.is = TRUE),
as.data.frame(t(sampleDF[2:3])))
NOTE: The first method would be more efficient
Or another approach would be to paste the values together in each column and then read it again
read.table(text=paste(sapply(sampleDF, paste, collapse=" "),
collapse="\n"), header = FALSE, stringsAsFactors = FALSE)
# V1 V2 V3 V4
#1 1 2 3 String
#2 1 2 3 4
#3 5 6 7 8
Or we can convert the 'data.frame' to 'data.matrix' which changes the character elements to NA, use the is.na to find the index of elements that are NA for replacing with the original string values
m1 <- data.matrix(sampleDF)
out <- as.data.frame(t(m1))
out[is.na(out)] <- sampleDF[is.na(m1)]
Or another option is type_convert from readr
library(dplyr)
library(readr)
sampleDF %>%
t %>%
as_data_frame %>%
type_convert
# A tibble: 3 x 4
# V1 V2 V3 V4
# <int> <int> <int> <chr>
#1 1 2 3 String
#2 1 2 3 4
#3 5 6 7 8

Use clipboard as vector R (windows)

I'm trying to read my clipboard in R as a vector. I have a large list of numbers I need in vector format and I tried copy and pasting values before, but R stops after 4000 numbers.
# 1,2,3,4,5,6,7,8 <--example of what's on clipboard
vector<-c(1,2,3,4,5,6,7,8)
vector[5]
#[1] 5
why cant I apply the same thing to the "read Clipboard" function?
vector<-c(readClipboard())
vector
vector[5]
#[1]"1,2,3,4,5,6,7,8"
#[1]NA
is there anyway to get rid of the quotes and use these values?
Use the clipr package.
> # clipboard: 1,2,3,4,5,6,7,8
> library(clipr)
> read_clip_tbl(sep=",", header=FALSE)
V1 V2 V3 V4 V5 V6 V7 V8
1 1 2 3 4 5 6 7 8
This output is a dataframe, but now you just have to take its first row:
tbl <- read_clip_tbl(sep=",", header=FALSE)
unlist(tbl[1,])
This gives a named vector:
V1 V2 V3 V4 V5 V6 V7 V8
1 2 3 4 5 6 7 8
If you don't want the names:
> unname(unlist(tbl[1,]))
[1] 1 2 3 4 5 6 7 8
EDIT
In fact you don't need clipr. You can do:
> read.table(file="clipboard", sep=",")
V1 V2 V3 V4 V5 V6 V7 V8
1 1 2 3 4 5 6 7 8
My use-case is generally working between Excel and R, in which case #StéphaneLaurent's answer works. (And, in fact, I tend to use read.table(file='clipboard',header=F) (or =F) instead of readClipboard(), as it has worked well for me in almost all situations.
If your string is non-tabular, though, as in a literal comma-delimited string, you can split it using:
# clipboard: 1,2,3,4,5,6,7,8
s <- strsplit(readClipboard(), ",")
str(s)
# List of 1
# $ : chr [1:8] "1" "2" "3" "4" ...
There are several data-massaging you may want/need to do, depending on your use, such as: as.integer(s[[1]]), as.numeric(s[[1]]), or trimws(s) (not useful here).
Notice that I have not yet un-listed it. In the event that you have more than one line copied, such as a clipboard with:
1,2,3,4,5
11,12,13,14
then
s <- strsplit(readClipboard(), ",")
str(s)
# List of 2
# $ : chr [1:5] "1" "2" "3" "4" ...
# $ : chr [1:4] "11" "12" "13" "14"
str(lapply(s, as.integer))
# List of 2
# $ : int [1:5] 1 2 3 4 5
# $ : int [1:4] 11 12 13 14
and you can either refer to each line within as (say) s[[1]] for 1-5, or you can (as suggested in other answers) unlist(s) to combine
unlist(lapply(s, as.integer))
# [1] 1 2 3 4 5 11 12 13 14
as.integer(unlist(s))
# [1] 1 2 3 4 5 11 12 13 14
If the number of entries is always the same (i.e., CSV), such as a single row
1,2,3,4,5
with
read.csv(file='clipboard', header=FALSE)
# V1 V2 V3 V4 V5
# 1 1 2 3 4 5
And multilines such as
1,2,3,4,5
11,12,13,14,15
into
read.csv(file='clipboard', header=FALSE)
# V1 V2 V3 V4 V5
# 1 1 2 3 4 5
# 2 11 12 13 14 15
etc. From here, it can be unlisted, as.matrixed, or whatever you want, though unlist will do it by-column instead of by-row.

Data frame list - Adding a new column based on data frame name

I have a list of dataframes, each one with several columns. An example of my data could be:
Ind_ID<-rep(1:15)
Mun<-sample(15)
T_i<-paste0("D",rep(1:5))
data<-cbind(Ind_ID,Mun,T_i)
data<-data.frame(data)
mylist<-split(data,data$T_i)
str(mylist)
List of 5
$ D1:'data.frame': 3 obs. of 3 variables:
..$ Ind_ID: Factor w/ 15 levels "1","10","11",..: 1 12 3
..$ Mun : Factor w/ 15 levels "1","10","11",..: 3 10 7
..$ T_i : Factor w/ 5 levels "D1","D2","D3",..: 1 1 1
$ D2:'data.frame': 3 obs. of 3 variables:
..$ Ind_ID: Factor w/ 15 levels "1","10","11",..: 8 13 4
..$ Mun : Factor w/ 15 levels "1","10","11",..: 14 11 5
..$ T_i : Factor w/ 5 levels "D1","D2","D3",..: 2 2 2
...
$ D5:'data.frame': 3 obs. of 3 variables:
..$ Ind_ID: Factor w/ 15 levels "1","10","11",..: 11 2 7
..$ Mun : Factor w/ 15 levels "1","10","11",..: 4 12 2
..$ T_i : Factor w/ 5 levels "D1","D2","D3",..: 5 5 5
I want to add a new column with the same name as the data frame. My expected output is:
$D1
Ind_ID Mun T_i D1
1 1 11 D1 NA
6 6 4 D1 NA
11 11 15 D1 NA
$D2
Ind_ID Mun T_i D2
2 2 8 D2 NA
7 7 5 D2 NA
12 12 13 D2 NA
....
$D5
Ind_ID Mun T_i D5
5 5 12 D5 NA
10 10 6 D5 NA
15 15 10 D5 NA
My failed attempts include:
nam<-as.list(names(mylist))
fun01 <- function(x,y){cbind(x, y = rep(1, nrow(x)))}
a1<-lapply(mylist, fun01,nam)
str(a1) # This generates a new column with the name "y" in all cases
fun02 <- function(x,y){x= cbind(x, a = rep(1, nrow(x)));names(x)[4] <- y}
a2<-lapply(mylist, fun02,nam)
str(a2) # It changes the data frames
Any help? Thanks in advance
You can loop through all the dataframes with a lapply call and create your new column with something like this:
newlist = lapply(1:length(mylist), function(i){
# Get the dataframe and the name
tmp_df = mylist[[i]]
tmp_name = names(mylist)[i]
# Create a new column with all NAs
tmp_df[,ncol(tmp_df) + 1] = NA
# Rename the newly created column
colnames(tmp_df)[ncol(tmp_df)] = tmp_name
# Return the df
return(tmp_df)
})
Option 1: You could use Map(). First we can write a little function for the iteration.
f <- function(df, nm) cbind(df, setNames(data.frame(NA), nm))
Map(f, mylist, names(mylist))
Option 2: You could live dangerously and do
Map("[<-", mylist, names(mylist), value = NA)

Keep Empty Factor Levels During Aggregation [duplicate]

This question already has answers here:
Empty factors in "by" data.table
(2 answers)
Closed 8 years ago.
I'm using data.table to aggregate values, but I'm finding that when the "by" variable has a level not present in the aggregation, it is omitted, even if it is specified in the factor levels.
The code below generates a data.table with 6 rows, the last two of which only have one of the two possible levels for f2 nested within f1. During aggregation, the {3,1} combination is dropped.
set.seed(1987)
dt <- data.table(f1 = factor(rep(1:3, each = 2)),
f2 = factor(sample(1:2, 6, replace = TRUE)),
val = runif(6))
str(dt)
Classes ‘data.table’ and 'data.frame': 6 obs. of 3 variables:
$ f1 : Factor w/ 3 levels "1","2","3": 1 1 2 2 3 3
$ f2 : Factor w/ 2 levels "1","2": 1 2 2 1 2 2
$ val: num 0.383 0.233 0.597 0.346 0.606 ...
- attr(*, ".internal.selfref")=<externalptr>
dt
f1 f2 val
1: 1 1 0.3829077
2: 1 2 0.2327311
3: 2 2 0.5965087
4: 2 1 0.3456710
5: 3 2 0.6058819
6: 3 2 0.7437177
dt[, sum(val), by = list(f1, f2)] # output is missing a row
f1 f2 V1
1: 1 1 0.3829077
2: 1 2 0.2327311
3: 2 2 0.5965087
4: 2 1 0.3456710
5: 3 2 1.3495996
# this is the output I'm looking for:
f1 f2 V1
1: 1 1 0.3829077
2: 1 2 0.2327311
3: 2 2 0.5965087
4: 2 1 0.3456710
5: 3 1 0.0000000 # <- the missing row from above
6: 3 2 1.3495996
Is there a way to achieve this behavior?
Why do you expect that data.table will compute sums for all combinations of f1 and f2?
If this what you want you should add missings rows to the original data before grouping sum. For example:
setkey(dt, f1, f2)
# omit "by = .EACHI" in data.table <= 1.9.2
dt[CJ(levels(f1), levels(f2)), sum(val, na.rm=T),
allow.cartesian = T, by = .EACHI]
## f1 f2 V1
## 1: 1 1 0.3829077
## 2: 1 2 0.2327311
## 3: 2 1 0.3456710
## 4: 2 2 0.5965087
## 5: 3 1 0.0000000 ## <- your "missing row" :)
## 6: 3 2 1.3495996

Resources