Losing data by using the function unlist - r

I have a simple but strange problem.
indices.list is a list, containing 118,771 Elements(integers or numeric). By applying the function unlist I lose about 500 elements.
Look at the following code:
> indices <- unlist(indices.list, use.names = FALSE)
>
> length(indices.list)
[1] 118771
> length(indices)
[1] 118248
How is that Possible?? I checked if indices.list contains any NA. But it does not:
> any(is.na(indices.list) == TRUE)
[1] FALSE
data.set.merged is a dataframe containing more than 200,000 rows. When I use the vector indices (which apparently has the length 118,248) in order to get a subset of data.set.merged, I get a dataframe with 118,771 rows!?? That's so strange!
data.set.merged.2 <- data.set.merged[indices, ]
> nrow(data.set.2)
[1] 118771
Any ideas whats going on here?

Well, for your first mystery, the likely explanation is that some elements of indices.list are NULL, which means they will disappear when you use unlist:
unlist(list(a = 1,b = "test",c = 2,d = NULL, e = 5))
a b c e
"1" "test" "2" "5"

Related

Why is the function work after doing fix() in R

Here's what had happened:
> NA.of.df = which(rowSums(is.na(df)) == ncol(df))
> NA.of.df
named integer(0)
> fix(df) # i want to see what's in here -- nothing wrong
> NA.of.df # so i run it again
1 3 5 7 9 # it works!
why would this happens??
A producible example (but doesn't seems like any data structure with dput()) is like the following:
> dput(NA.of.df)
structure(integer(0), .Names = character(0))
and NA.of.df is just the code for finding rows with all NAs (obtained from here:
Remove rows in R matrix where all data is NA). (i.e. NA.of.df = which(rowSums(is.na(df)) == ncol(df)))
It could be an issue with quotes around the NA resulting in is.na to not pick up those elements
is.na(c(NA, "NA"))
#[1] TRUE FALSE
After doing the fix, it may have dropped the quotes and evaluate it correctly

Getting last character/number of data frame column

I'm trying to get the last character or number of a series of symbols on data frame so I can filter some categories after. But I'm not getting the expected result.
names = as.character(c("ABC Co","DEF Co","XYZ Co"))
code = as.character(c("ABCN1","DEFMO2","XYZIOIP4")) #variable length
my_df = as.data.frame(cbind(names,code))
First Approach:
my_df[,3] = substr(my_df[,2],length(my_df[,2]),length(my_df[,2]))
What I expected to receive was: c("1","2","4")
What I am really receiving is : c("C","F","Z")
Then, I realized that length(my_df[,2]) is the number of rows of my data frame, and not the length of each cell. So, I decided to create this loop:
for (i in length(nrow(my_df))){
my_df[i,3] = substr(my_df[i,2],length(my_df[i,2]),length(my_df[i,2]))
}
What I expected to receive was: c("1","2","4")
What I am really receiving is : c("A","F","Z")
So then I tried:
for (i in length(nrow(my_df))){
my_df[i,3] = substr(my_df[i,2],-1,-1)
}
What I expected to receive was: c("1","2","4")
What I am really receiving is : c("","F","Z")
Not getting any luck, any thoughts of what am I missing? Thank you very much!
length is a vector (or list) property, whereas in substr you probably need a string property. Base R's nchar works.
my_df = as.data.frame(cbind(names, code), stringsAsFactors = FALSE)
substr(my_df[,2], nchar(my_df[,2]), nchar(my_df[,2]))
# [1] "1" "2" "4"
(I added stringsAsFactors = FALSE, otherwise you'll need to add as.character.)
If the last character is always a number you can do:
library(stringr)
str_extract(my_df$code, "\\d$")
[1] "1" "2" "4"
If the last character can be anything you can do this:
str_extract(my_df$code, ".$")
You can use substr:
my_df$last_char <- substr(code, nchar(code), nchar(code))
# or my_df$last_char <- substr(my_df$code, nchar(my_df$code), nchar(my_df$code))
Output
my_df
# names code last_char
# 1 ABC Co ABCN1 1
# 2 DEF Co DEFMO2 2
# 3 XYZ Co XYZIOIP4 4
We can use sub
sub(".*(\\d+$)", "\\1", my_df$code)

Replacing items in one list from another w/matching names

I feel like i've forgotten something very obvious here...
Let's say we have two lists, a and b, with differing lengths:
a <- list(me = "you1", they = "our1", our = "till1", grow = "NOPE1")
b <- list(me = "my2", their = "his2", our = "aft2", new = "noise2",
they = "now2", b_names = "thurs2")
We want to replace the items in a with corresponding items from b, if an item in b has the same name as an item in a.
Manually, essentially this would equate to replacing: me, our, they in list a from those items in list b.
For the life of me the only approach i'm coming up with is using Reduce rather than match or %chin% etc, to find the intersection of names and then always using the last list object as the look-up table. I suppose you really don't need to Reduce since intersect would work find on it's own.. but regardless...
Isn't there a simpler, more straight forward way that I am simply forgetting?
Here's my code.. it works..but that's not the point.
reduce.names <- function(...){
vars <- list(...)
if(length(vars) > 2){
return("only 2 lists allowed...")
}else {
Reduce(intersect, Map(names,vars))
}
}
> matched_names <- reduce.names(a,b)
> matched_names
[1] "me" "they" "our"
a[matched_names] <- b[matched_names]
> a
$me
[1] "my2"
$they
[1] "now2"
$our
[1] "aft2"
$grow
[1] "NOPE1"
here's another approach that works... but just seems redundant and sketchy...
> merge(a,b) %>% .[names(a)]
$me
[1] "my2"
$they
[1] "now2"
$our
[1] "aft2"
$grow
[1] "NOPE1"
Any advice/alternate approach/reminder of some base function I have completely forgotten would be greatly appreciated. Thanks.

How to delete row n and n+1 in a dataframe?

I have several dataframes in which I want to delete each row that matches a certain string. I used the following code to do it:
df[!(regexpr("abc", df$V4) ==1),]
How can I delete the row that is following, e.g. if I delete row n as specified by the code above, how can I additionally delete row n+1?
My first try was to simply find out the indices of the desired rows, but that won't work, as I need to delete rows in different dataframes which are of different lengths. So the indices vary.
Thanks!
I would suggest taking out and manipulating the logical vector directly. Suppose we have the vector:
x = c(5,0,1, 4, 3)
and we want to do:
x[x > 3]
First, note that:
R> (s_n = x>3)
[1] TRUE FALSE FALSE TRUE FALSE
So
R> (s_n1 = as.logical(s_n + c(F, l[1:(length(s_n)-1)])))
[1] TRUE TRUE FALSE TRUE TRUE
Hence,
x[s_n1]
gives you what you want.
In your particular example, something like:
s_n = !(regexpr("abc", df$V4) == 1)
s_n1 = as.logical(s_n + c(F, l[1:(length(s_n)-1)])))
df[s_n1, ]
should work.
Use which() on your logical expression and then you can just add 1 to the result.
sel <- which(grep("abc", df$V4))
sel <- c(sel, sel+1)
df[-sel,]
df[which(!(regexpr("abc", df$V4) ==1))+c(0,1),]

Assigning NULL to a list element in R?

I found this behaviour odd and wanted more experienced users to share their thoughts and workarounds.
On running the code sample below in R:
sampleList <- list()
d<- data.frame(x1 = letters[1:10], x2 = 1:10, stringsAsFactors = FALSE)
for(i in 1:nrow(d)) {
sampleList[[i]] <- d$x1[i]
}
print(sampleList[[1]])
#[1] "a"
print(sampleList[[2]])
#[1] "b"
print(sampleList[[3]])
#[1] "c"
print(length(sampleList))
#[1] 10
sampleList[[2]] <- NULL
print(length(sampleList))
#[1] 9
print(sampleList[[2]])
#[1] "c"
print(sampleList[[3]])
#[1] "d"
The list elements get shifted up.
Maybe this is as expected, but I am trying to implement a function where I merge two elements of a list and drop one. I basically want to lose that list index or have it as NULL.
Is there any way I can assign NULL to it and not see the above behaviour?
Thank you for your suggestions.
Good question.
Check out the R-FAQ:
In R, if x is a list, then x[i] <- NULL and x[[i]] <- NULL remove the specified elements from x. The first of these is incompatible with S, where it is a no-op. (Note that you can set elements to NULL using x[i] <- list(NULL).)
consider the following example:
> t <- list(1,2,3,4)
> t[[3]] <- NULL # removing 3'd element (with following shifting)
> t[2] <- list(NULL) # setting 2'd element to NULL.
> t
[[1]]
[2] 1
[[2]]
NULL
[[3]]
[3] 4
UPDATE:
As the author of the R Inferno commented, there can be more subtle situations when dealing with NULL. Consider pretty general structure of code:
# x is some list(), now we want to process it.
> for (i in 1:n) x[[i]] <- some_function(...)
Now be aware, that if some_function() returns NULL, you maybe will not get what you want: some elements will just disappear. you should rather use lapply function.
Take a look at this toy example:
> initial <- list(1,2,3,4)
> processed_by_for <- list(0,0,0,0)
> processed_by_lapply <- list(0,0,0,0)
> toy_function <- function(x) {if (x%%2==0) return(x) else return(NULL)}
> for (i in 1:4) processed_by_for[[i]] <- toy_function(initial[[i]])
> processed_by_lapply <- lapply(initial, toy_function)
> processed_by_for
[[1]]
[1] 0
[[2]]
[1] 2
[[3]]
NULL
[[4]]
[1] 4
> processed_by_lapply
[[1]]
NULL
[[2]]
[1] 2
[[3]]
NULL
[[4]]
[1] 4
Your question is a bit confusing to me.
Assigning null to an existing object esentially deletes that object (this can be very handy for instance if you have a data frame and wish to delete specific columns). That's what you've done. I am unable to determine what it is that you want though. You could try
sampleList[[2]] <- NA
instead of NULL, but if by "I want to lose" you mean delete it, then you've already succeeded. That's why, "The list elements get shifted up."
obj = list(x = "Some Value")
obj = c(obj,list(y=NULL)) #ADDING NEW VALUE
obj['x'] = list(NULL) #SETTING EXISTING VALUE
obj
If you need to create a list of NULL values which later you can populate with values (dataframes, for example) here is no complain:
B <-vector("list", 2)
a <- iris[sample(nrow(iris), 10), ]
b <- iris[sample(nrow(iris), 10), ]
B[[1]]<-a
B[[2]]<-b
The above answers are similar, but I thought this was worth posting.
Took me a while to figure this one out for a list of lists. My solution was:
mylist[[i]][j] <- list(double())

Resources