How to subset a single column fataframe and get a dataframe

How to subset a single column fataframe and get a dataframe - r

I have a DataFrame with only one column and rownames
> head(UMIpCells_df, n=10)
UMIs
MB04_GATAACTGGCCT 4571.266
MB04_ACCCTGTCATTT 4534.992
MB04_GTAAGACGAATG 4793.417
MB04_AGGCTATTCCAA 4786.393
MB04_ATTATCTGATTT 4478.233
MB04_CCCGGGTCTGCC 4765.347
MB04_AAACGAGCTGAC 4571.253
MB04_TGTTGCTTTTCG 4167.119
MB04_ACGTCCCCCAAA 4778.961
MB04_GTCGCGCAGTTC 4664.638
I want to subset the firs 5 rows but I got a numeric vector:
> UMIpCells_df[1:5,]
[1] 4571.266 4534.992 4793.417 4786.393 4478.233
However if I add an extra column to the UMIpCell_df the subset returns a df.
I found out that to return a df from a single column dataframe I have to add:
drop = False
> UMIpCells_df[(1:5), ,drop=FALSE]
UMIs
MB04_GATAACTGGCCT 4571.266
MB04_ACCCTGTCATTT 4534.992
MB04_GTAAGACGAATG 4793.417
MB04_AGGCTATTCCAA 4786.393
MB04_ATTATCTGATTT 4478.233
However I found this odd and as basic as it is I will like to learn why subsetting the simplest df (only 1 column) has to be different that subsetting any other DataFrame (>1 column). Hope you do not get offended by the elementary of this question.

Consider using tibbles and data_frame instead of the standard data.frame. While not base R, packages such as dplyr help to "correct" some of these behaviors you noticed that may no longer be beneficial.
Check out the vignette on tibbles here:
https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html
And here is a brief comparison of tibbles to data frames as well as some comparisons when subsetting:
http://r4ds.had.co.nz/introduction-2.html#tibbles

head(UMIpCells_df, n=5) is also a data frame, so you can just do:
new.df <- head(UMIpCells_df, n=5)

Related

apply resulting in list instead of dataframe

I have a dataframe with cases that repeat on the rows. Some rows have more complete data than others. I would like to group cases and then assign the first non-missing value to all NA cells in that column for that group. This seems like a simple enough task but I'm stuck. I have working syntax but when I try to use apply to apply the code to all columns in the dataframe I get a list back instead of a dataframe. Using do.call(rbind) or rbindlist or unlist doesn't quite fix things either.
Here's the syntax.
df$groupid<-group_indices (df,id1,id2) #creates group id on the basis of a combination of two variables
df%<>%group_by(id1,id2) #actually groups the dataframe according to these variables
df<-summarise(df, xvar1=xvar1[which(!is.na(xvar1))[1]]) #this code works great to assign the first non missing value to all missing values but it only works on 1 column at a time (X1).
I have many columns so I try using apply to make this a manageable task..
df<-apply(df, MARGIN=2, FUN=function(x) {summarise(df, x=x[which(!is.na(x))[1]])
}
)
This gets me a list for each variable, I wanted a dataframe (which I would then de-duplicate). I tried rbindlist and do.call(rbind) and these result in a long dataframe with only 3 columns - the two group_by variables and 'x'.
I know the problem is simply how I'm using apply, probably the indexing with 'which', but I'm stumped.

What about using lapply with do.call and cbind, like the following:
df <- do.call(cbind, lapply(df, function(x) {summarise(df, x=x[which(!is.na(x))[1]])}))

Find data by multiple variable names in R

I have a question regarding variables names in R.
In my dataset I have a list of 70 variable names as characters and I want to find the corresponding data (including the header) in the data.
For example I used the dataset iris. I don't want to select all variables by iris$Sepal.Length since I have 70 variables in the dataset that I use. In my code I can print the data but I am struggling with saving the data as a dataframe with the corresponding header names. Somebody any thoughts?
iris
head(iris)
colnames(iris)
b <- list("Sepal.Length","Petal.Length")
i=1
for (i in 1:length(b)){
#print(b[[i]])
print(iris[,c(b[[i]])])
c[,i]<-(iris[,c(b[[i]])])
}

It sounds like you're trying to get a subset of 70 columns from a data.frame or matrix. The 70 columns you have are stored in a list. R will let you get columns named by a character vector, but not by a list. So, you can just use unlist.
b <- list("Sepal.Length","Petal.Length")
newTable <- iris[,unlist(b)]

I find dplyr the best for this. If you turn iris into a tibble
iris <- as_tibble(iris)
You can then use the dplyr::select function either selecting by name (no quotes) or by position. You can even use the 1:5 notation selecting columns 1 to 5. A great place to start is: http://r4ds.had.co.nz

Are you looking for this ?
b <- c("Sepal.Length","Petal.Length")
New_iris=iris[,b]

R - creating dataframe from colMeans function

I've been trying to create a dataframe from my original dataframe, where rows in the new dataframe would represent mean of every 20 rows of the old dataframe. I discovered a function called colMeans, which does the job pretty well, the only problem, which still persists is how to change that vector of results back to dataframe, which can be further analysed.
my code for colMeans: (matrix1 in my original dataframe converted to matrix, this was the only way I managed to get it to work)
a<-colMeans(matrix(matrix1, nrow=20));
But here I get the numeric sequence, which has all the results concatenated in one single column(if I try for example as.data.frame(a)). How am I supposed to get this result back into dataframe where each column includes only the results for specific column name and not all the averages.
I hope my question is clear, thanks for help.

Based on the methods('as.data.frame'), as.data.frame.list is an option to convert each element of a vector to columns of a data.frame
as.data.frame.list(a)
data
m1 <- matrix(1:20, ncol=4, dimnames=list(NULL, paste0('V', 1:4)))
a <- colMeans(m1)

Applying a function to a dataframe to trim empty columns within a list environment R

I am a naive user of R and am attempting to come to terms with the 'apply' series of functions which I now need to use due to the complexity of the data sets.
I have large, ragged, data frame that I wish to reshape before conducting a sequence of regression analyses. It is further complicated by having interlaced rows of descriptive data(characters).
My approach to date has been to use a factor to split the data frame into sets with equal row lengths (i.e. a list), then attempt to remove the trailing empty columns, make two new, matching lists, one of data and one of chars and then use reshape to produce a common column number, then recombine the sets in each list. e.g. a simplified example:
myDF <- as.data.frame(rbind(c("v1",as.character(1:10)),
c("v1",letters[1:10]),
c("v2",c(as.character(1:6),rep("",4))),
c("v2",c(letters[1:6], rep("",4)))))
myDF[,1] <- as.factor(myDF[,1])
myList <- split(myDF, myDF[,1])
myList[[1]]
I can remove the empty columns for an individual set and can split the data frame into two sets from the interlacing rows but have been stumped with the syntax in writing a function to apply the following function to the list - though 'lapply' with 'seq_along' should do it?
Thus for the individual set:
DF <- myList[[2]]
DF <- DF[,!sapply(DF, function(x) all(x==""))]
DF
(from an earlier answer to a similar, but simpler example on this site). I have a large data set and would like an elegant solution (I could use a loop but that would not use the capabilities of R effectively). Once I have done that I ought to be able to use the same rationale to reshape the frames and then recombine them.
regards
jac

Try
lapply(split(myDF, myDF$V1), function(x) x[!colSums(x=='')])

renaming subset of columns in r with paste0

I have a data frame (my_df) with columns named after individual county numbers. I melted/cast the data from a much larger set to get to this point. The first column name is year and it is a list of years from 1970-2011. The next 3010 columns are counties. However, I'd like to rename the county columns to be "column_"+county number.
This code executes in R but for whatever reason doesn't update the column names. they remain solely the numbers... any help?
new_col_names = paste0("county_",colnames(my_df[,2:ncol(my_df)]))
colnames(my_df[,2:ncol(my_df)]) = new_col_names

The problem is the subsetting within the colnames call.
Try names(my_df) <- c(names(my_df)[1], new_col_names) instead.
Note: names and colnames are interchangeable for data.frame objects.
EDIT: alternate approach suggested by flodel, subsetting outside the function call:
names(my_df)[-1] <- new_col_names

colnames() is for a matrix (or matrix-like object), try simply names() for a data.frame
Example:
new_col_names=paste0("county_",colnames(my_df[,2:ncol(my_df)]))
my_df <- data.frame(a=c(1,2,3,4,5), b=rnorm(5), c=rnorm(5), d=rnorm(5))
names(my_df) <- c(names(my_df)[1], new_col_names)