Square brackets and dataframes in R - r

Okay I am rather puzzled by the different behaviours of dataframes and xtses in R and I'm hoping someone can explain it to me.
df = as.data.frame(x = c(1,2),row.names = c("2012-12-12","2012-12-13"))
xts = as.xts(x=c(1,2),order.by = as.POSIXct(c("2012-12-12","2012-12-13")))
I have two different datasets here. When you print them, they look almost similar. When I want the first row of the xts, xts[1,] returns the row with colnames and the index. But when you do df[1,] it only returns a vector.
Is there a way to return the first row of the dataframe, complete with the rownames and colname? I'm aware that I can hack it by doing as.data.frame(as.xts(df)[1,]) but is there a more elegant solution?

This is a very particular case where a subsetting operation by rows on a data frame has only ONE cell.
In that case, you need to specify drop = FALSE, here
df[1, , drop = FALSE]
I'd add a recommendation that when you create a data frame from scratch, use the data.frame() function instead of as.data.frame()

Related

R how to create a dataframe by adding columns

I am very very new to R....I have been using Python and MATLAB my whole life.
So here is what I would like to do. During each loop, I compute a column that I would like to add on to a dataframe.
Problem is that I do not know the length of the column. So I cannot create the dataframe to a specific length. So I keep getting an error when I try to add the column to the empty original empty dataframe...
# extract the data where the column 7 has no data.
df_glm <- data.frame(matrix(ncol = 11, nrow = 0))
for (j in 1:ncol(data_cancer)){
col_ele <- data_cancer[,j]
col_filtered <- col_ele[col_bool7]
# make new dataframe by concetenating the filtered column.
df_glm[,i] <- col_filtered
}
data_cancer_filter <- data_cancer[,col_bool7]
How can I resolve this issue?
I am getting an error at df_glm[,i] because the column is as long as col_bool7. But I want to learn how to do this without creating dataframe of exact size beforehand.
If I am understanding this correctly, you're looping through columns and taking the rows where col_bool7 is TRUE and putting it in another dataframe. dplyr filter() would be an efficient solution:
library(dplyr)
df_glm = data_cancer %>%
filter(col_bool7)

R - Sum every two columns in a dataframe and paste results to new columns at the end

I have a dataframe of dynamic length, i.e. it get's longer everytime new variables are attached. In this case, I need to sum the values in every two columns 8:length(df) and attach the results (the sum of every two columns) at the end of this dataframe. So what I want to automate for alle columns in question is this:
df <- df %>%
mutate(sumAB = A + B)
Ideally, I would like to name these new columns based on a vector containing the intended colnames, which I already prepared. As I am fairly new to R, I could not get this running with for loops or the apply family. Every suggestion appreciated.
Thanks!
You can use split.default to split every two columns and then using lapply sum the values.
cols <- 8:ncol(df)
result <- cbind(df[1:8], sapply(split.default(df[cols],
rep(1:length(cols), each = 2, length.out = length(cols))),
rowSums, na.rm = TRUE))
result

How can I get the column/variable names of a dataframe that fit certain parameters?

I came across a problem in my DataCamp exercise that basically asked "Remove the column names in this vector that are not factors." I know what they -wanted- me to do, and that was to simply do glimpse(df) and manually delete elements of the vector containing the column names, but that wasn't satisfying for me. I figured there was a simple way to store the column names of the dataframe that are factors into a vector. So, I tried two things that ended up working, but I worry they might be inefficient.
Example data Frame:
factorVar <- as.factor(LETTERS[1:10])
df1 <- data.frame(x = 1, y = 1:10, factorVar = sample(factorVar, 10))
My first solution was this:
vector1 <- names(select_if(df1, is.factor))
This worked, but select_if returns an entire tibble of a filtered dataframe and then gets the column names. Surely there's an easier way...
Next, I tried this:
vector2 <- colnames(df1)[sapply(df1,is.factor)]
This also worked, but I wanted to know if there's a quicker, more efficient way of filtering column names based on their type and then storing the results as a vector.

R equivalent to SAS's "In" data set option for including and excluding overlapping data

I'm usually a SAS user but was wondering if there was a similar way in R to list data that can only be found in one data frame after merging them. In SAS I would have used
data want;
merge have1 (In=in1) have2 (IN=in2) ;
if not in2;
run;
to find the entries only in have1.
My R code is:
inner <- merge(have1, have2, by= "Date", all.x = TRUE, sort = TRUE)
I've tried setdiff() and antijoin() but neither seem to give me what I want. Additionally, I would like to find a way to do the converse of this. I would like to find the entries in have1 and have2 that have the same "Date" entry and then keep the remaining variables in the 2 data frames. For example, consider have1 with columns "Date", "ShotHeight", "ShotDistance" and have2 with columns "Date", "ThrowHeight", "ThrowDistance" so that the m]new dataframe, call it "new" has columns "Date", ShotHeight", "ShotDistance", "ThrowHeight", "ThrowDistance".
Assuming only one by-variable, the simplest solution is not to merge at all:
want <- subset(have1, !(county %in% have2$county))
This subsets have1 to exclude rows where the value of county is in have2.

How should I apply the same formatting to a list of dataframes in R?

Here is what I've done so far. So, that's basically grabbing some tables off the internet using XML, putting them into a list of dataframes and then some mess trying (and failing) to format them in an efficient and consistent way.
I can't work out how to apply the same changes to all of the dataframes. I think I need to use llply, but I can't get it right. Overall I am trying to achieve:
Column names all legitimate R names using make.names, then use the
str_replace_all towards the end of the file to strip all non-alpha
characters so the names are the same
Next I want to remove all but the first four columns from all of the dataframes
Then I want to add a column with the title for each book. I guess I'll have to do this manually.
Finally, I want to do an rbind to join all of the dataframes together
What's really got me stumped is how to apply the same transformations to each dataframe in the list such as modifying their column names and cutting off rows. Is llply the right tool for the job? How do I use it?
So far the most I've been able to achieve is turning my list of dataframes into a list of vectors with the right names. I believe this is because when I tried using names() it returned the vector of correct names, rather than a dataframe with the correct names. This was my attempt:
tlist <- llply(tabs, function(x) as.data.frame(str_replace_all(make.names(names(x)), "[^[:alpha:]]", "")))
I don't think I'm a million miles away here, but I can't think how to get it to return the full df.
Use this instead:
f <- function(x)
{
y <- x[,1:4]
names(y) <- str_replace_all(make.names(names(y)), "[^[:alpha:]]", "")
y
}
result <- rbind.fill(llply(tabs, f))
EDIT: following #baptiste, this may be better:
result <- ldply(tabs, f)

Resources