R: replace NA with item from vector - r

I am trying to replace some missing values in my data with the average values from a similar group.
My data looks like this:
X Y
1 x y
2 x y
3 NA y
4 x y
And I want it to look like this:
X Y
1 x y
2 x y
3 y y
4 x y
I wrote this, and it worked
for(i in 1:nrow(data.frame){
if( is.na(data.frame$X[i]) == TRUE){
data.frame$X[i] <- data.frame$Y[i]
}
}
But my data.frame is almost half a million lines long, and the for/if statements are pretty slow. What I want is something like
is.na(data.frame$X) <- data.frame$Y
But this gets a mismatched size error. It seems like there should be a command that does this, but I cannot find it here on SO or on the R help list. Any ideas?

ifelse is your friend.
Using Dirk's dataset
df <- within(df, X <- ifelse(is.na(X), Y, X))

Just vectorise it -- the boolean index test is one expression, and you can use that in the assignment too.
Setting up the data:
R> df <- data.frame(X=c("x", "x", NA, "x"), Y=rep("y",4), stringsAsFactors=FALSE)
R> df
X Y
1 x y
2 x y
3 <NA> y
4 x y
And then proceed by computing an index of where to replace, and replace:
R> ind <- which( is.na( df$X ) )
R> df[ind, "X"] <- df[ind, "Y"]
which yields the desired outcome:
R> df
X Y
1 x y
2 x y
3 y y
4 x y
R>

If you are already using dplyr or tidyverse, you can use the coalesce function to do exactly this.
> df <- data.frame(X=c("x", "x", NA, "x"), Y=rep("y",4), stringsAsFactors=FALSE)
> df %>% mutate(X = coalesce(X, Y))
X Y
1 x y
2 x y
3 y y
4 x y```

Unfortunately I cannot comment, yet, but while vectorizing some code where strings aka characters were involved the above seemd to not work. The reason being explained in this answer. If characters are involved stringsAsFactors=FALSE is not enough because R might already have created factors out of characters. One needs to ensure that the data also becomes a character vector again, e.g., data.frame(X=as.character(c("x", "x", NA, "x")), Y=as.character(rep("y",4)), stringsAsFactors=FALSE)

Related

selecting columns from a set of names with dplyr

I'm attempting to make subsets of a large data frame based on whether the column names are in an externally defined set. So I'm starting with something like:
> x <- c(1,2,3)
> y <- c("a","b","c")
> z <- c(4,5,6)
>
> df <- data.frame(x=x,y=y,z=z)
> df
x y z
1 1 a 4
2 2 b 5
3 3 c 6
chosen_columns <- c(x,y)
And I'm attempting to use this much to end up with:
x y
1 1 a
2 2 b
3 3 c
It seems like using select() from dplyr should be able to handle this perfectly, but I'm not sure what the arguments would be to get that. Something like:
df_chosen <- df %>%
select(is.element(___,chosen_columns))
I'm just not sure what would go in the ___ there.
Thank you!
c(x, y) is not a vector of two columns: it's combining your objects x and y into a vector of characters: c("1", "2", "3", "a","b","c").
You may want to create a vector of column names and then pass it directly to select():
library(dplyr)
chosen_columns <- c("x", "y")
df |> select(all_of(chosen_columns))
(Thank you, Gregor Thomas, for the advice to wrap column names in all_of()).

How to get the sum of the product of selected column in a data frame?

This is probably very simple but I couldn't think of a solution.
I have the following data frame, and I want to multiply column y with column z and sum the answer.
> df <- data.frame(x = c(1,2,3), y = c(2,4,6), z = c(2,3,4))
> df
x y z
1 1 2 2
2 2 4 3
3 3 6 4
The value found should be equal to 40.
with would be an option here if we don't want to repeat df$ or df[[ to extract the column
with(df, sum( y * z))
#[1] 40
Or %*%
c(df$y %*% df$z)
Additionally, you could use data table. The second row after the comma indicates columns (j). You don't need the spaces, they're just there to show how it works.
library(data.table)
a <- data.table(x = c(1,2,3), y = c(2,4,6), z = c(2,3,4))
#dt i j by
a[ , sum(y*z), ]

r: subsetting with square brackets not working

I made data frame called x:
a b
1 2
3 NA
3 32
21 7
12 8
When I run
y <- x["a">2,]
The object y returned is identical to x. If I run
y <- x["a" == 1,]
y is an empty frame.
I made sure that the names of the x data frame have no white spaces (I named them myself with names() ) and also that a and are numeric.
PS: If I try
y <- x["a">2]
y is also returned as identical to x.
You're making an error in referencing the column of your data.frame x.
"a">2 means character a bigger than two, not variable a of data.frame x. You need to add either x$a or x["a"] to reference your data.frame column.
try
y <- x[x$a >2 ,]
or
y <- x[x["a"] >2 ,]
or even more clear
ix <- x["a"] > 2
y <- x[ix,]
A simple alternative would be using data.table
library(data.table)
setDT(x)
y <- x[ a > 2, ]
y <- x[ a == 1, ]

R subset vector when treated as strings

I have a large data frame where I've forced my vectors into a string (using lapply and toString) so they fit into a dataframe and now I can't check if one column is a subset of the other. Is there a simple way to do this.
X <- data.frame(y=c("ABC","A"), z=c("ABC","A,B,C"))
X
y z
1 ABC ABC
2 A A,B,C
all(X$y %in% X$z)
[1] FALSE
(X$y[1] %in% X$z[1])
[1] TRUE
(X$y[2] %in% X$z[2])
[1] FALSE
I need to treat each y and z string value as a vector (comma separated) again and then check if y is a subset of z.
In the above case, A is a subset of A,B,C. However because I've treated both as strings, it doesnt work.
In the above y is just one value and z is 1 and 3. The data frames sample I'll be testing is 10,000 rows and the y will have 1-5 values per row and z 1-100 per row. It looks like the 1-5 are always a subset of z, but I'd like to check.
df = data.frame(y=c("ABC","A"), z=c("ABC","A,B,C"))
apply(df, 1, function(x) { # perform rowise ops.
y = unlist(strsplit(x[1], ",")) # splitting X$y if incase it had ","
z = y %in% unlist(strsplit(x[2], ",")) # check how many of 'X$y' present in 'X$z'
if (sum(z) == length(y)) # if all present then return TRUE
return(TRUE)
else
return(FALSE)
})
# 1] TRUE TRUE
# Case 2: changed the data. You will have to define if you want perfect subset or not. Accordingly we can update the code
df = data.frame(y=c("ABC","A,B,D"), z=c("ABC","A,B,C"))
#[1] TRUE FALSE
I think it might work better for you not to use your lapply and toString combination, but store the lists in your data frame. For this purpose, I find the tbl_df (as found in the tibble package) more friendly, although I believe data.table objects can do this as well (someone correct me if I'm wrong)
library(tibble)
y_char <- list("ABC", "A")
z_char <- list("ABC", c("A", "B", "C"))
X <- data_frame(y = y_char,
z = z_char)
Notice that when you print X now, your entries in each row of the tibble are entries from the list. Now we can use mapply to do pairwise comparison.
# All y in z
mapply(function(x, y) all(x %in% y),
X$y,
X$z)
# All z in y
mapply(function(x, y) all(y %in% x),
X$y,
X$z)

Drop columns when splitting data frame in R

I am trying to split data table by column, however once I get list of data tables, they still contains the column which data table was split by. How would I drop this column once the split is complete. Or more preferably, is there a way how do I drop multiple columns.
This is my code:
x <- rnorm(10, mean = 5, sd = 2)
y <- rnorm(10, mean = 5, sd = 2)
z <- sample(5, 10, replace = TRUE)
dt <- data.table(x, y, z)
split(dt, dt$z)
The resulting data table subsets looks like that
$`1`
x y z
1: 6.179790 5.776683 1
2: 5.725441 4.896294 1
3: 8.690388 5.394973 1
$`2`
x y z
1: 5.768285 3.951733 2
2: 4.572454 5.487236 2
$`3`
x y z
1: 5.183101 8.328322 3
2: 2.830511 3.526044 3
$`4`
x y z
1: 5.043010 5.566391 4
2: 5.744546 2.780889 4
$`5`
x y z
1: 6.771102 0.09301977 5
Thanks
Splitting a data.table is really not worthwhile unless you have some fancy parallelization step to follow. And even then, you might be better off sticking with a single table.
That said, I think you want
split( dt[, !"z"], dt$z )
# or more generally
mysplitDT <- function(x, bycols)
split( x[, !..bycols], x[, ..bycols] )
mysplitDT(dt, "z")
You would run into the same problem if you had a data.frame:
df = data.frame(dt)
split( df[-which(names(df)=="z")], df$z )
First thing that came to mind was to iterate through the list and drop the z column.
lapply(split(dt, dt$z), function(d) { d$z <- NULL; d })
And I just noticed that you use the data.table package, so there is probably a better, data.table way of achieving your desired result.

Resources