I want to select 3117 columns out of a data frame,
I tried to select them by column names:
dataframe %>%
select(
'AAACCTGAGCACGCCT-1',
'AAACCTGAGCGCTTAT-1',
'AAACCTGAGCGTTGCC-1',
......,
'TTGGAACCACGGACAA-1'
)
or
firstpickupnames <- ('AAACCTGAGCACGCCT-1','AAACCTGAGCGCTTAT-1','AAACCTGAGCGTTGCC-1',......,'TTGGAACCACGGACAA-1')
Both ways the R console just replied
'AAACCTGAGCACGCCT-1','AAACCTGAGCGCTTAT-1','AAACCTGAGCGTTGCC-
1',......,'TTGGAACCACGGACAA-1'
+ )
+
What does this mean? Is there a limitation of columns that I can select in R?
Without a reproducible example, it's difficult to know what exactly you're looking for, but dplyr::select() has several options for selecting columns, and dplyr::everything() might be what you're looking for:
library(dplyr)
# this reorders the column names, but keeps everything without having to name the columns specifically:
mtcars %>%
select(carb, gear, everything())
# from a list of column names:
keep_columns <- c('cyl','disp','hp')
mtcars %>%
select(one_of(keep_columns))
# specific names, and a range of names:
mtcars %>%
select(hp, qsec:gear)
#You could also use `contains()`, `starts_with()`, `ends_with()`, or `matches()`. Note that calling all of the following at once will give you no results:
mtcars %>%
select(contains('t')) %>%
select(starts_with('a')) %>%
select(ends_with('b')) %>%
select(matches('^m.+g$'))
The way that the console replies (with the + indicating that it is waiting for the rest of the expression) strongly suggests that you are encountering a limitation in the capacity for the console to process long commands (which you are attempting to assemble via pasting from the clipboard) rather than an inherent limit in the number of columns which can be selected. The only place I could find in the documentation to this limitation is here where it says "Command lines entered at the console are limited to about 4095 bytes."
In the comments you said that the column names that you wanted to select were in a csv file. You didn't say much about the structure of the csv file, but say that you have a csv file that contains a single list of column names. As an example, I created a file named "colnames.csv" which has a single line:
Sepal.Width, Petal.Length
Note that there is no need to manually place quote marks around the column names in the text file. Then in the R console I typed:
iris %>% select(one_of(as.character(read.csv("colnames.csv",header = FALSE, strip.white = TRUE,stringsAsFactors = FALSE))))
which worked as expected. Even though this example only used 2 columns, there is no reason that it should fail with 3000+, since the number of columns per se wasn't the problem with what you were doing.
If the structure of the csv file is different from the example then you would need to adjust the call to read.csv and perhaps the way that you convert it to a character vector, but you should be able to tweak this approach to your situation.
Related
I have a database saved in excel, and when I bring it into R there are many columns that should be numeric, but they get listed as characters. I know that in read_excel I can specify each column format using the col_types = "numeric", but I have > 500 columns, so this gets a bit tedious.
Any suggestions on how to do this either when importing with read_excel, or after with dplyr or something similar?
I can do this 1 by 1 using a function that I wrote but it still requires writing out each column name
convert_column <- function(data, col_name) {
new_col_name <- paste0(col_name)
data %>% mutate(!!new_col_name := as.numeric(!!sym(col_name)))
}
convert_column("gFat_OVX") %>%
convert_column("gLean_OVX")%>%
convert_column("pFat_OVX") %>%
convert_column("pLean_OVX")
I would ideally like to say "if a column contains the text "Fat" or "Lean" in the header, then convert to numeric", but I'm open to suggestions.
select(df, contains("Fat" | "Lean"))
I'm not sure how to make an example that allows people to test this out, given that we're starting with an excel sheet here.
dplyr::mutate and across may be a solution after reading in the data.
Something like this, where df1 is your data frame from read_excel:
library(dplyr)
df1 <- df1 %>%
mutate(across(contains(c("Fat", "Lean")), ~as.numeric(.x)))
I had a general question regarding most efficient coding as a beginner - I have a very wide dataset (374 obs), on which I have to do several manipulations on. I'll mainly be using 'mutate' and 'unite' . My question is:
How I write the code now is that everytime I do something new (ie if I combine 6 columns into one), then I'll write a separate code for that and create a new dataframe.
Underneath there'll be another code for 'mutate' like if I have to create a new variable by summing two columns.
here's an example:
#1B. Combine location columns.
combinedlocations <- rawdata1 %>% unite(location, locations1,locations2, locations3, na.rm = TRUE,
remove=TRUE)
combinedlocations <- combinedlocations[-c(6:7)] #drop the unwanted columns
#2. Combine Sector together into one new column: Sector
#B. Combine columns, but override if Type.of.org = 'Independent Artist', where Sector = "Independent
Artist"
Combinedsectors <- combinedlocations %>% unite(Sector, Sectors, na.rm=TRUE, remove=TRUE) %>%
I basically create a new dataframe for each manipulation, using the one I just created.
Is this correct? This is how I learned to do it on SAS. OR, is it better to do it all in one dataframe (maybe rawdata2) and is there a way to combine all these codes together, using %>% ? (I'm still trying to learn how piping works)
This is on the edge of "opinion-based", but it's a good question. tl;dr it doesn't matter very much, it's mostly a matter of your preferred style.
putting everything in one long pipe sequence (a %>% b %>% c %>% d) without intermediate assignments means you don't have as many intermediate objects cluttering your workspace; this means (1) you don't have to come up with names for them all (data1, data2, ...) and (2) you don't use up memory making lots of extra objects (this isn't a problem unless you're working with Big Data)
On the other hand,
putting everything in one long pipe sequence can make it harder to debug, because it's harder to inspect intermediate results if something goes wrong; this blog post lists a variety of packages/tools that are handy for debugging pipe sequences.
I tend to use piping sequences of about 5-6 lines. Your code in piped format would look something like this ...
#1B. Combine location columns.
Combinedsectors <- (rawdata1
%>% unite(location, locations1,locations2, locations3,
na.rm = TRUE, remove=TRUE)
%>% select(-(6:7))
#2. Combine Sector together into one new column: Sector
#B. Combine columns, but override if Type.of.org = 'Independent Artist', where Sector = "Independent Artist"
%>% unite(Sector, Sectors, na.rm=TRUE, remove=TRUE)
%>% ... <whatever>
)
I have to import an unorthodox file into R. I've attached a small example file with fake data to demonstrate the issue. The raw data that I need to wrangle is shown in the image "raw" and the tidied data that I'm looking to create are shown in the image "tidy".
RAW
TIDY
Each individual has (1) group level information that applies to all individuals within the same group and (2) individual level information that only applies to the respective person. In the attached file, group level data includes Family and Location. Then, there are repeating sets of columns that pertain to each individual depending on how many people belong in the group.
For example, line 2 represents the Smith family that lives in Chicago. The Smith family has 3 members, including John, Sally, and Ben. Each member has their own set of repeating column names with the same information types: Name, Age, Gender, Hobbies. Each of these sets of columns have identical names and are repeated for up to a maximum of 3 individuals per family (9 total columns).
What I need is to import this data into R and transform it into a tidy format, ideally using a tidyverse solution.
Thanks for your help!
Perhaps the best strategy may depend also on how you input your raw data (e.g., from Excel).
If you happen to have Excel data, you can use read_excel from tidyverse and can include .name_repair = "minimal" to prevent changes in column names.
In this case, with repair_names you can have a consistent structure to column names that are repeated, perhaps with an underscore (this would give you Name, Name_1, Name_2, Age, Age_1, Age_2, etc.).
Finally, pivot_longer of your repeated columns would provide a tidy data frame.
Also, there are a number of alternative ways to fix your repeating column names and make unique; for example, make.unique called on names(df) or clean_names(df) from janitor package.
library(tidyverse)
library(readxl)
df <- read_excel("raw_data.xlsx", .name_repair = "minimal")
df %>%
repair_names(sep = "_") %>%
pivot_longer(-c(Family, Location), names_to = c(".value", "variable"), names_sep = "_") %>%
select(-variable)
I have an HMMSCAN result file of protein domains with 10 columns. please see the link for the CSV file.
https://docs.google.com/spreadsheets/d/10d_YQwD41uj0q5pKinIo7wElhDj3BqilwWxThfIg75s/edit?usp=sharing
But I want it to look like this:-
1BVN:P|PDBID|CHAIN|SEQUENCE Alpha-amylase Alpha-amylase_C A_amylase_inhib
3EF3:A|PDBID|CHAIN|SEQUENCE Cutinase
3IP8:A|PDBID|CHAIN|SEQUENCE Amdase
4Q1U:A|PDBID|CHAIN|SEQUENCE Arylesterase
4ROT:A|PDBID|CHAIN|SEQUENCE Esterase
5XJH:A|PDBID|CHAIN|SEQUENCE DLH
6QG9:A|PDBID|CHAIN|SEQUENCE Tannase
The repeated entries of column 3 should get grouped and its corresponding values of column 1, which are in different rows, should be arranged in separate columns.
This is what i wrote till now:
df <- read.csv ("hydrolase_sorted.txt" , header = FALSE, sep ="\t")
new <- df %>% select (V1,V3) %>% group_by(V3) %>% spread(V1, V3)
I hope I am clear with the problem statement. Thanks in advance!!
Your input data set has two unregular rows. However, the approach in your solution is right but one more step is required:
library(dplyr)
df %>% select(V3,V1) %>% group_by(V3) %>% mutate(x = paste(V1,collapse=" ")) %>% select(V3,x)
What we did here is simply concentrating strings by V3. Before running the abovementioned code in this solution you should preprocess and fix some improper rows manually. The rows (TIM, Dannase, and DLH). To do that you can use the Convert texts into column function of in Excel.
Required steps defined are below. Problematic columns highlighted yellow:
Sorry for the non-English interface of my Excel but the way is self-explanatory.
I need a dataframe containing the names of some files matching a pattern mapped to each line in those files. My problem is, that I am unable to generate multiple rows for each row, the dataframe should grow in columns and rows, expanded per row. What I need is basically a left outer join, but I am struggling with the syntax.
library(dplyr)
app.lsts <- data.frame(
file=list.files(path='.', pattern='app.lst', recursive=TRUE)
) %>%
mutate(command=paste0('cat ', file)) %>%
mutate(packages=system(command, intern=TRUE))
The last mutate does not work because packages is a list of lines. How do I "unwrap" these?
First, some working (but not very good code):
require(tidyverse)
out_df <-
list.files(path='.', pattern='*.foo', recursive=TRUE) %>%
map(~readLines(file(.x))) %>%
setNames(fnames) %>%
t %>%
as.data.frame %>%
gather(file, lines) %>%
unnest()
out_df
This is a tidyverse-style command to generate the data that I think you want. Since I don't have your input files, I made up these sample files:
contents of f1.foo
line_1_f1
line_2_f1
contents of f2.foo
line_1_f2
line_2_f2
line_3_f2
Changes relative to your approach:
Avoid the use of the built-in function file() as a column name. I used fname instead.
Don't use system to read the files, there is built-in R functions to do that. Use of system() needlessly makes porting your code to other operating systems far more unlikely to succeed.
Build the data frame after all the data is read into R, not before. Because of the way non-standard evaluation with dplyr works, it's hard to use readLines(...) inside of a mutate() where the file connection to be read varies.
Use purrr::map() to generate a list of lists of file content lines from a list of filenames. This is a tidyverse way of writing a for-loop.
Set the names of the list elements with setNames().
Munge this list into a data.frame using t() and as.data.frame()
Tidy the data with gather() to collapse the data frame that has one column per file into a data frame with one file per row.
Expand the list using unnest().
I don't think this approach is very pretty, but it works. Another approach that avoids the ugly steps 5 and 6 is a for loop.
fnames <- list.files(path='.', pattern='*.foo', recursive=TRUE)
out_df <- data.frame(fname = c(), lines=c())
for(fname in fnames){
fcontents <- readLines(file(fname)) %>% as.character
this_df <- data.frame(fname = fname, lines = fcontents)
out_df <- bind_rows(out_df, this_df)
}
The output in either case is
fname lines
1 f1.foo line_1_f1
2 f1.foo line_2_f1
3 f2.foo line_1_f2
4 f2.foo line_2_f2
5 f2.foo line_3_f2