Order and subset a multi-column dataframe in R?

Order and subset a multi-column dataframe in R? - r

I wanted to order by some column, and subset, a multi-column dataframe but the command used did not work
print(df[order(df$x) & df$x < 5,])
This does not order the results.
To debug this I generated a test dataframe with 1 column but this 'simplification' had unexpected effects
df <- data.frame(x = sample(1:50))
print(df[order(df$x) & df$x < 5,])
This does not order the results so I felt I had reproduced the problem but with simpler data.
Breaking down the process to first ordering and then subsetting led me to discover the ordering in this case does not generate a dataframe object
df <- data.frame(x = sample(1:50))
ndf <- df[order(df$x),]
print(class(ndf))
produces
[1] "integer"
Attempting to subset the resultant "integer" ndf object using dataframe syntax e.g.
print(ndf[ndf$x < 5, ])
obviously generates an error:
Error in ndf$x : $ operator is invalid for atomic vectors.
Simplifying even further, I found subsetting alone (not applying the order function ) does not generate a dataframe object
ndf <- df[df$x < 5,]
class(ndf)
[1] "integer"
It turns out for the multicolumn dataframe that separating the ordering and the subsetting does work as expected
df <- data.frame(x = sample(1:50), y = rnorm(50))
ndf <- df[order(df$x),]
print(ndf[ndf$x < 5, ])
and this solved my original problem, but led to two further questions:
Why is the type of object returned, as described above based on the 1 column dataframe test case, not a dataframe? ( I appreciate a 1 column dataframe just contains a single vector but it's still wrapped in a dataframe ?)
Is it possible to order and subset a multicolumn dataframe in 1 step?

A data.frame in R automatically simplifies to vectors when selecting just one column. This is a common and useful simplification and is better described in this question. Of course you can prevent that with drop=FALSE.
Subsetting and ordering are two different operations. You should do them in two logical steps (but possibly one line of code). This line doesn't make a lot of sense
df[order(df$x) & df$x < 5,]
Subsetting in R can either be done with a vector of row indices (which order() returns) or boolean values (which the < comparison returns). Mixing them (with just an &) doesn't make it clear how R should perform the subset. But you can break that out into two steps with subset()
subset(df[order(df$x),], x < 5)
This does the ordering first and then the subsetting. Note that the condition no longer directory references the value of df specfically, it's will filter the data from the re-ordered data.frame.
Operations like this is one of the reasons many people perfer the dplyr library for data manipulations. For example this can be done with
library(dplyr)
dd <- data.frame(x = sample(1:50))
dd %>% filter(x<5) %>% arrange(x)

Related

How do I replace specific cell values in dataframe using continuous (sequential) indexing?

I have two dataframes of equal dimensions.
One has some value in cells (i.e. 'abc') that i need to index. Other has all different values. And I need to replace the values in other dataframe with the same index as 'abc'.
Examples:
df1 <- data.frame('1'=c('abc','bbb','rweq','dsaf','cxc','rwer','anc','ewr','yuje','gda'),
'2'=c(NA,NA,'bbb','dsaf','rwer','dsaf','ewr','cxc','dsaf','cxc'),
'3'=c(NA,NA,'dsaf','abc','bbb','cxc','yuje',NA,'ewr','anc'),
'4'=c(NA,NA,'cxc',NA,'abc','anc',NA,NA,'yuje','rweq'),
'5'=c(NA,NA,'anc',NA,'abc',NA,NA,NA,'rwer','rwer'),
'6'=c(NA,NA,'rweq',NA,'dsaf',NA,NA,NA,'bbb','bbb'),
'7'=c(NA,NA,'abc',NA,'ewr',NA,NA,NA,'abc','abc'),
'8'=c(NA,NA,'abc',NA,'rweq',NA,NA,NA,'cxc','bbb'),
'9'=c(NA,NA,NA,NA,'abc',NA,NA,NA,'anc',NA),
'10'=c(NA,NA,NA,NA,'abc',NA,NA,NA,'rweq',NA))
df2 <- data.frame('1'=c('green','black','white','yelp','help','green','red','brown','green','crack'),
'2'=c(NA,NA,'black','yelp','green','yelp','brown','help','yelp','help'),
'3'=c(NA,NA,'yelp','green','black','help','green',NA,'brown','red'),
'4'=c(NA,NA,'help',NA,'green','red',NA,NA,'green','white'),
'5'=c(NA,NA,'red',NA,'green',NA,NA,NA,'green','green'),
'6'=c(NA,NA,'white',NA,'yelp',NA,NA,NA,'black','black'),
'7'=c(NA,NA,'green',NA,'brown',NA,NA,NA,'green','green'),
'8'=c(NA,NA,'green',NA,'white',NA,NA,NA,'help','black'),
'9'=c(NA,NA,NA,NA,'green',NA,NA,NA,'red',NA),
'10'=c(NA,NA,NA,NA,'green',NA,NA,NA,'white',NA))
I can find sequential index of 'abc', but it returns one-sized vector
which(df1 == 'abc')
#[1] 1 24 35 45 63 69 70 73 85 95
And i don't know how to replace values using this method
In output expected to view df2 with replaced values 'green' only on the same indexes as values 'abc' in df1.
But note!! that 'green' values in df2 are not only in the same indexes as in df1

I don't think your problem is appropriately approached with the data in a data.frame. That introduces several complications. First, each variable (column) in the data frame is a factor with different levels! Second, your code is making a comparison between a list (data.frame) and a factor (which is coerced into an atomic vector). The help function for the == operator states ..if the other is a list R attempts to coerce it to the type of the atomic vector.. The help function also points out that factors get special handling in comparisons where it first assumes you are comparing factor levels, which your code is doing.
I think you want to convert your data frames of identical dimensions to a matrix first. If you need the results in a data.frame, convert it back after as I show here but realize that the factor levels may have changed.
# Starting with the values assigned to df1 and df2
m1 <- as.matrix(df1)
m2 <- as.matrix(df2)
index <- which(m1 == "abc")
m2[index] <- "abc"
df2 <- as.data.frame(m2)

Here is a way to. Learn about the *apply family in R: I think it is the most useful group of functions in this language, whatever you plan to do ;) Also know that data.frame are of 'list' type.
df1 <- lapply(df1, function(frame, pattern, replace){ # for each frame = column:
matches <- which(pattern %in% frame) # what are the matching indexes of the frame
if(length(matches) > 0) # If there is at least one index matching,
frame[matches] <- replace # give it the value you want
return(frame) # Commit your changes back to df1
}, pattern="abc", replace= "<whatYouWant>") # don't forget this part: the needed arguments !

Data.frame of Data.frames

I'm using a data.frame that contains many data.frames. I'm trying to access these sub-data.frames within a loop. Within these loops, the names of the sub-data.frames are contained in a string variable. Since this is a string, I can use the [,] notation to extract data from these sub-data.frames. e.g. X <- "sub.df"and then df[42,X] would output the same as df$sub.df[42].
I'm trying to create a single row data.frame to replace a row within the sub-data.frames. (I'm doing this repeatedly and that's why my sub-data.frame name is in a string). However, I'm having trouble inserting this new data into these sub-data.frames. Here is a MWE:
#Set up the data.frames and sub-data.frames
sub.frame <- data.frame(X=1:10,Y=11:20)
df <- data.frame(A=21:30)
df$Z <- sub.frame
Col.Var <- "Z"
#Create a row to insert
new.data.frame <- data.frame(X=40,Y=50)
#This works:
df$Z[3,] <- new.data.frame
#These don't (even though both sides of the assignment give the correct values/dimensions):
df[,Col.Var][6,] <- new.data.frame #Gives Warning and collapses df$Z to 1 dimension
df[7,Col.Var] <- new.data.frame #Gives Warning and only uses first value in both places
#This works, but is a work-around and feels very inelegant(!)
eval(parse(text=paste0("df$",Col.Var,"[8,] <- new.data.frame")))
Are there any better ways to do this kind of insertion? Given my experience with R, I feel like this should be easy, but I can't quite figure it out.

using adist on two columns of data frame

I want to use adist to calculate edit distance between the values of two columns in each row.
I am using it in more-or-less this way:
A <- c("mad","car")
B <- c("mug","cat")
my_df <- data.frame(A,B)
my_df$dist <- adist(my_df$A, my_df$B, ignore.case = TRUE)
my_df <- my_df[order(dist),]
The last two rows are the same as in my case, but the actual data frame looks a bit different - columns of my original data frame are character type, not factor. Also, the dist column seems to be returned as 2-column matrix, I have no idea why it happens.
Update:
I have read a bit and found that I need to apply it over the rows, so my new code is following:
apply(my_df, 1, function(d) adist(d[1], d[2]))
It works fine, but for my original dataset calling it by column numbers is inpractical, how can I refer to column names in this function?

Using tidyverse approach, you may use the following code:
library(tidyverse)
A <- c("mad","car")
B <- c("mug","cat")
my_df <- data.frame(A,B)
my_df %>%
rowwise() %>%
mutate(Lev_dist=adist(x=A,y=B,ignore.case=TRUE))

You can overcome that problem by using mapply, i.e.
mapply(adist, df$A, df$B)
#[1] 2 1

As per adist function definition the x and y arguments should be character vectors. In your example the function is returning a 2x2 matrix because it is comparing also the cross words "mad" with "cat" and "car" with "mug".
Just look at the matrix master diagonal.

Looping a rep() function in r

df is a frequency table, where the values in a were reported as many times as recorded in column x,y,z. I'm trying to convert the frequency table to the original data, so I use the rep() function.
How do I loop the rep() function to give me the original data for x, y, z without having to repeat the function several times like I did below?
Also, can I input the result into a data frame, bearing in mind that the output will have different column lengths:
a <- (1:10)
x <- (6:15)
y <- (11:20)
z <- (16:25)
df <- data.frame(a,x,y,z)
df
rep(df[,1], df[,2])
rep(df[,1], df[,3])
rep(df[,1], df[,4])

If you don't want to repeat the for loop, you can always try using an apply function. Note that you cannot store it in a data.frame because the objects are of different lengths, but you could store it in a list and access the elements in a similar way to a data.frame. Something like this works:
df2<-sapply(df[,2:4],function(x) rep(df[,1],x))
What this sapply function is saying is for each column in df[,2:4], apply the rep(df[,1],x) function to it where x is one of your columns ( df[,2], df[,3], or df[,4]).
The below code just makes sure the apply function is giving the same result as your original way.
identical(df2$x,rep(df[,1], df[,2]))
[1] TRUE
identical(df2$y,rep(df[,1], df[,3]))
[1] TRUE
identical(df2$z,rep(df[,1], df[,4]))
[1] TRUE
EDIT:
If you want it as a data.frame object you can do this:
res<-as.data.frame(sapply(df2, '[', seq(max(sapply(df2, length)))))
Note this introduces NAs into your data.frame so be careful!

R - brevity when subsetting?

I'm still new to R and do all of my subsetting via the pattern:
data[ command that produces logical with same length as data ]
or
subset( data , command that produces logical with same length as data )
for example:
test = c("A", "B","C")
ignore = c("B")
result = test[ !( test %in% ignore ) ]
result = subset( test , !( test %in% ignore ) )
But I vaguely remember from my readings that there's a shorter/(more readable?) way to do this? Perhaps using the "with" function?
Can someone list alternative to the example above to help me understand the options in subsetting?

I don't know of a more succinct way of subsetting for your specific example, using only vectors. What you may be thinking of, regarding with, is subsetting data frames based on conditions using columns from that data frame. For example:
dat <- data.frame(variable1 = runif(10), variable2 = letters[1:10])
If we want grab a subset of dat based on a condition using variable1 we could do this:
dat[dat$variable1 < 0,]
or we can save ourselves having to write dat$* each time by using with:
with(dat,dat[variable1 < 0,])
Now, you'll notice that I really didn't save any keystrokes by doing that in this case. But if you have a data frame with a long name, and a complicated condition it can save you a bit. See also the related ?within command if you're altering the data frame in question.
Alternatively, you can use subset which can do essentially the same thing:
subset(dat, variable1 < 0)
subset can also handle conditions on the columns via the select argument.

The with function would help if test were a column in a data frame (or object in a list), but with global vectors with does not help.
Some people have created a not in operator that could save a couple of key strokes from what you did. If all the values in test are unique then the setdiff function may be what you are thinking of (but if for example you had multiple "A"s then setdiff would only return 1 of them).
With your ignore being only 1 value you could use test != ignore, but that does not generalize to ignore having 2 or more values.

I have seen timed comparisons of alternate methods and %in% (based on match) was one of the best performing strategies.
Alternates:
test[!test=="B"] #logical indexing
test[which(test != "B")] #numeric indexing
# the which() is not superfluous when there are NA's if you want them ignored

Another alternative to the original example:
test[test != ignore]
Other ways, using joran's example:
set.seed(1)
df <- data.frame(variable1 = runif(10), variable2 = letters[1:10])
Returning one column: df[[1]]. df$name is equivalent to df[["name", exact = FALSE]]
df[df[[1]] < 0.5, ]
df[df["variable1"] < 0.5, ]
Returning one data frame of one column: df[1]
df[df[1] < 0.5, ]
Using with
with(df, df[df[[1]] < 0.5, ]) # One column
with(df, df[df["variable1"] < 0.5, ]) # One column
with(df, df[df[1] < 0.5, ]) # data frame of one column
Using dplyr:
library(dplyr)
filter(df, variable1 < 0.5)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Order and subset a multi-column dataframe in R? - r

Related

How do I replace specific cell values in dataframe using continuous (sequential) indexing?

Data.frame of Data.frames

using adist on two columns of data frame

Looping a rep() function in r

R - brevity when subsetting?

Categories

Resources