I made data frame called x:
a b
1 2
3 NA
3 32
21 7
12 8
When I run
y <- x["a">2,]
The object y returned is identical to x. If I run
y <- x["a" == 1,]
y is an empty frame.
I made sure that the names of the x data frame have no white spaces (I named them myself with names() ) and also that a and are numeric.
PS: If I try
y <- x["a">2]
y is also returned as identical to x.
You're making an error in referencing the column of your data.frame x.
"a">2 means character a bigger than two, not variable a of data.frame x. You need to add either x$a or x["a"] to reference your data.frame column.
try
y <- x[x$a >2 ,]
or
y <- x[x["a"] >2 ,]
or even more clear
ix <- x["a"] > 2
y <- x[ix,]
A simple alternative would be using data.table
library(data.table)
setDT(x)
y <- x[ a > 2, ]
y <- x[ a == 1, ]
Related
I'm attempting to make subsets of a large data frame based on whether the column names are in an externally defined set. So I'm starting with something like:
> x <- c(1,2,3)
> y <- c("a","b","c")
> z <- c(4,5,6)
>
> df <- data.frame(x=x,y=y,z=z)
> df
x y z
1 1 a 4
2 2 b 5
3 3 c 6
chosen_columns <- c(x,y)
And I'm attempting to use this much to end up with:
x y
1 1 a
2 2 b
3 3 c
It seems like using select() from dplyr should be able to handle this perfectly, but I'm not sure what the arguments would be to get that. Something like:
df_chosen <- df %>%
select(is.element(___,chosen_columns))
I'm just not sure what would go in the ___ there.
Thank you!
c(x, y) is not a vector of two columns: it's combining your objects x and y into a vector of characters: c("1", "2", "3", "a","b","c").
You may want to create a vector of column names and then pass it directly to select():
library(dplyr)
chosen_columns <- c("x", "y")
df |> select(all_of(chosen_columns))
(Thank you, Gregor Thomas, for the advice to wrap column names in all_of()).
lets take an example dataframe with removal of variable columns:
frame <- data.frame("a" = 1:5, "b" = 2:6, "c" = 3:7, "d" = 4:8)
rem <- readline()
frame <- subset(frame, select = -c(rem))
How do I get the variable column to be removed? This is not my real code, just wanted to present my problem in a simple code. Thanks!
Edit: I am so sorry, I am really sleepy and don't know what I typed into my code, I edited it now.
1) Do both at once. We assume that ix contains at least one column number.
ix <- 1:2
frame[-ix]
## c d
## 1 3 4
## 2 4 5
## 3 5 6
## 4 6 7
## 5 7 8
1a) or if the case where ix is zero length, ix <- c(), is important we can do this. The output of this and all the rest are the same as for (1) so we won't repeat the output.
ix <- 1:2
frame[setdiff(seq_along(frame), ix)]
1b) or if we have names rather than column numbers. This works even if nms is a zero length vector in which case it returns the original data frame.
nms <- c("a", "b")
frame[setdiff(names(frame), nms)]
2) or if you need to do it iteratively remove the largest one first because if it were done in ascending order then after the first one is removed the second column is no longer the second but is the first. If we knew that ix is already sorted we could omit the sort. We have used frame_out to hold the result so that the input is not destroyed. This works even if ix is the empty vector.
ix <- 1:2
frame_out <- frame
for(i in rev(sort(ix))) frame_out <- frame_out[-i]
frame_out
3) One way to do it independent of order is to do it by name. In this case it would be possible to remove them in ascending order. This works even if ix the empty vector.
ix <- 1:2
nms <- names(frame)[ix]
frame_out <- frame
for(nm in nms) frame_out <- frame_out[-match(nm, names(frame_out))]
frame_out
I have a large data frame where I've forced my vectors into a string (using lapply and toString) so they fit into a dataframe and now I can't check if one column is a subset of the other. Is there a simple way to do this.
X <- data.frame(y=c("ABC","A"), z=c("ABC","A,B,C"))
X
y z
1 ABC ABC
2 A A,B,C
all(X$y %in% X$z)
[1] FALSE
(X$y[1] %in% X$z[1])
[1] TRUE
(X$y[2] %in% X$z[2])
[1] FALSE
I need to treat each y and z string value as a vector (comma separated) again and then check if y is a subset of z.
In the above case, A is a subset of A,B,C. However because I've treated both as strings, it doesnt work.
In the above y is just one value and z is 1 and 3. The data frames sample I'll be testing is 10,000 rows and the y will have 1-5 values per row and z 1-100 per row. It looks like the 1-5 are always a subset of z, but I'd like to check.
df = data.frame(y=c("ABC","A"), z=c("ABC","A,B,C"))
apply(df, 1, function(x) { # perform rowise ops.
y = unlist(strsplit(x[1], ",")) # splitting X$y if incase it had ","
z = y %in% unlist(strsplit(x[2], ",")) # check how many of 'X$y' present in 'X$z'
if (sum(z) == length(y)) # if all present then return TRUE
return(TRUE)
else
return(FALSE)
})
# 1] TRUE TRUE
# Case 2: changed the data. You will have to define if you want perfect subset or not. Accordingly we can update the code
df = data.frame(y=c("ABC","A,B,D"), z=c("ABC","A,B,C"))
#[1] TRUE FALSE
I think it might work better for you not to use your lapply and toString combination, but store the lists in your data frame. For this purpose, I find the tbl_df (as found in the tibble package) more friendly, although I believe data.table objects can do this as well (someone correct me if I'm wrong)
library(tibble)
y_char <- list("ABC", "A")
z_char <- list("ABC", c("A", "B", "C"))
X <- data_frame(y = y_char,
z = z_char)
Notice that when you print X now, your entries in each row of the tibble are entries from the list. Now we can use mapply to do pairwise comparison.
# All y in z
mapply(function(x, y) all(x %in% y),
X$y,
X$z)
# All z in y
mapply(function(x, y) all(y %in% x),
X$y,
X$z)
I have a data frame that looks something like this:
x y
1 a
1 b
1 c
1 NA
1 NA
2 d
2 e
2 NA
2 NA
And my desired output should be a data frame that should display the sum of all complete cases of Y (that is the non-NA values) with the corresponding X. So if supposing Y has 2500 complete observations for X = 1, and 557 observations for X = 2, I should get this simple data frame:
x y(c.cases)
1 2500
2 557
Currently my function performs well but only for a single X but when I mention X to be a range (for ex. 30:25) then I get the sum of all the Ys specified instead of individual complete observations for each X. This is an outline of my function:
complete <- function(){
files <- file.list()
dat<- c() #Creates an empty vector
Y <- c() #Empty vector that will list down the Ys
result <- c()
for(i in c(X)){
dat <- rbind(dat, read.csv(files[i]))
}
dat_subset_Y <- dat[which(dat[, 'X'] %in% x), ]
Y <- c(Y, sum(complete.cases(dat)))
result <- cbind(X, Y)
print(result)
}
There are no errors or warning messages but only wrong results in a range of Xs.
We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'x', get the sum of all non NA elements (!is.na(y)).
library(data.table)
setDT(df1)[, list(y=sum(!is.na(y))), by = x]
Or another option is table
with(df1, table(x, !is.na(y)))
no need for that loop.
library(dplyr)
df %>%
filter(complete.cases(.))%>%
group_by(x) %>%
summarise(sumy=length(y))
Or
df %>%
group_by(x) %>%
summarise(sumy=sum(!is.na(y)))
I am trying to replace some missing values in my data with the average values from a similar group.
My data looks like this:
X Y
1 x y
2 x y
3 NA y
4 x y
And I want it to look like this:
X Y
1 x y
2 x y
3 y y
4 x y
I wrote this, and it worked
for(i in 1:nrow(data.frame){
if( is.na(data.frame$X[i]) == TRUE){
data.frame$X[i] <- data.frame$Y[i]
}
}
But my data.frame is almost half a million lines long, and the for/if statements are pretty slow. What I want is something like
is.na(data.frame$X) <- data.frame$Y
But this gets a mismatched size error. It seems like there should be a command that does this, but I cannot find it here on SO or on the R help list. Any ideas?
ifelse is your friend.
Using Dirk's dataset
df <- within(df, X <- ifelse(is.na(X), Y, X))
Just vectorise it -- the boolean index test is one expression, and you can use that in the assignment too.
Setting up the data:
R> df <- data.frame(X=c("x", "x", NA, "x"), Y=rep("y",4), stringsAsFactors=FALSE)
R> df
X Y
1 x y
2 x y
3 <NA> y
4 x y
And then proceed by computing an index of where to replace, and replace:
R> ind <- which( is.na( df$X ) )
R> df[ind, "X"] <- df[ind, "Y"]
which yields the desired outcome:
R> df
X Y
1 x y
2 x y
3 y y
4 x y
R>
If you are already using dplyr or tidyverse, you can use the coalesce function to do exactly this.
> df <- data.frame(X=c("x", "x", NA, "x"), Y=rep("y",4), stringsAsFactors=FALSE)
> df %>% mutate(X = coalesce(X, Y))
X Y
1 x y
2 x y
3 y y
4 x y```
Unfortunately I cannot comment, yet, but while vectorizing some code where strings aka characters were involved the above seemd to not work. The reason being explained in this answer. If characters are involved stringsAsFactors=FALSE is not enough because R might already have created factors out of characters. One needs to ensure that the data also becomes a character vector again, e.g., data.frame(X=as.character(c("x", "x", NA, "x")), Y=as.character(rep("y",4)), stringsAsFactors=FALSE)