I have to import an unorthodox file into R. I've attached a small example file with fake data to demonstrate the issue. The raw data that I need to wrangle is shown in the image "raw" and the tidied data that I'm looking to create are shown in the image "tidy".
RAW
TIDY
Each individual has (1) group level information that applies to all individuals within the same group and (2) individual level information that only applies to the respective person. In the attached file, group level data includes Family and Location. Then, there are repeating sets of columns that pertain to each individual depending on how many people belong in the group.
For example, line 2 represents the Smith family that lives in Chicago. The Smith family has 3 members, including John, Sally, and Ben. Each member has their own set of repeating column names with the same information types: Name, Age, Gender, Hobbies. Each of these sets of columns have identical names and are repeated for up to a maximum of 3 individuals per family (9 total columns).
What I need is to import this data into R and transform it into a tidy format, ideally using a tidyverse solution.
Thanks for your help!
Perhaps the best strategy may depend also on how you input your raw data (e.g., from Excel).
If you happen to have Excel data, you can use read_excel from tidyverse and can include .name_repair = "minimal" to prevent changes in column names.
In this case, with repair_names you can have a consistent structure to column names that are repeated, perhaps with an underscore (this would give you Name, Name_1, Name_2, Age, Age_1, Age_2, etc.).
Finally, pivot_longer of your repeated columns would provide a tidy data frame.
Also, there are a number of alternative ways to fix your repeating column names and make unique; for example, make.unique called on names(df) or clean_names(df) from janitor package.
library(tidyverse)
library(readxl)
df <- read_excel("raw_data.xlsx", .name_repair = "minimal")
df %>%
repair_names(sep = "_") %>%
pivot_longer(-c(Family, Location), names_to = c(".value", "variable"), names_sep = "_") %>%
select(-variable)
Related
I have a data frame with bacteria families from with all their OTUs (phylum, order, family...).
The data frame is large and I would like the name of each column to be only the last part of each string. The one that starts with "f___"
For example
I tried some methods in R (like dplyr::filter or filter(str_detect))and also separating columns in Excel and could not get what I wanted. I don't do it manually because it's too many columns.
df being your dataframe, you could use rename_with from package dplyr:
df %>%
rename_with(
## your renaming function (see ?gsub for help on
## replacing with search patterns (regular expressions):
~ gsub('.*;f___(.*)$', '\\1', .x),
## column selection (see ?dplyr::select for handy shortcuts)
cols = everything()
)
the .x in the replacement formula ~ etc. represents the variable argument to the replacement function, in this case the 'old' column name. You'll encounter this 'dot-something' pattern frequently in tidyverse packages.
microbiota <- read_csv("Tablas/nivel5-familia_clean.csv")
colnames(microbiota) <- gsub(colnames(microbiota),pattern = '.*f__', replacement = "")
I solve it like this.
I am using R for a project for University. I imported a csv file and created a df. Everything was going smoothly until I had to gather the percentages of age groups in the "Age" column. There are 3,000 rows of information in my df. How do I only sample information from rows 50-200 to find the percentages of people ages 15-20, 21-25, 26-30, and 31-35?
You can try creating another df which only takes information from rows 50-200 using the slice function e.g my_data %>% slice(1:6) would give rows 1-6 I believe. Incase you didnt know, this function exists in tidyverse, which you can call using library(tidyverse). For filtering by particular age groups, you can again use the tidyverse filter function, e.g my_data %>% filter.
If your goal is to sample, better than slice specific rows you can use the function sample_n
I had a general question regarding most efficient coding as a beginner - I have a very wide dataset (374 obs), on which I have to do several manipulations on. I'll mainly be using 'mutate' and 'unite' . My question is:
How I write the code now is that everytime I do something new (ie if I combine 6 columns into one), then I'll write a separate code for that and create a new dataframe.
Underneath there'll be another code for 'mutate' like if I have to create a new variable by summing two columns.
here's an example:
#1B. Combine location columns.
combinedlocations <- rawdata1 %>% unite(location, locations1,locations2, locations3, na.rm = TRUE,
remove=TRUE)
combinedlocations <- combinedlocations[-c(6:7)] #drop the unwanted columns
#2. Combine Sector together into one new column: Sector
#B. Combine columns, but override if Type.of.org = 'Independent Artist', where Sector = "Independent
Artist"
Combinedsectors <- combinedlocations %>% unite(Sector, Sectors, na.rm=TRUE, remove=TRUE) %>%
I basically create a new dataframe for each manipulation, using the one I just created.
Is this correct? This is how I learned to do it on SAS. OR, is it better to do it all in one dataframe (maybe rawdata2) and is there a way to combine all these codes together, using %>% ? (I'm still trying to learn how piping works)
This is on the edge of "opinion-based", but it's a good question. tl;dr it doesn't matter very much, it's mostly a matter of your preferred style.
putting everything in one long pipe sequence (a %>% b %>% c %>% d) without intermediate assignments means you don't have as many intermediate objects cluttering your workspace; this means (1) you don't have to come up with names for them all (data1, data2, ...) and (2) you don't use up memory making lots of extra objects (this isn't a problem unless you're working with Big Data)
On the other hand,
putting everything in one long pipe sequence can make it harder to debug, because it's harder to inspect intermediate results if something goes wrong; this blog post lists a variety of packages/tools that are handy for debugging pipe sequences.
I tend to use piping sequences of about 5-6 lines. Your code in piped format would look something like this ...
#1B. Combine location columns.
Combinedsectors <- (rawdata1
%>% unite(location, locations1,locations2, locations3,
na.rm = TRUE, remove=TRUE)
%>% select(-(6:7))
#2. Combine Sector together into one new column: Sector
#B. Combine columns, but override if Type.of.org = 'Independent Artist', where Sector = "Independent Artist"
%>% unite(Sector, Sectors, na.rm=TRUE, remove=TRUE)
%>% ... <whatever>
)
Forgive me if this has been asked before. I am using the following code to create a list of groups, produced with LSD.test (agricolae) and nested by id.
lsd_groups <- dataset %>%
group_by(id) %>%
do(lsd_statistics = LSD.test(lm(value ~ book_name + treatment_name, data=.),
"treatment_name", alpha=0.1)$groups) %>%
unnest()
My problem is when I unnest the results, I use the identifiers (treatment names) associated with the means in the grouping.
I know if I were to leave the LSD.test output as a list, I could see the treatment names by running:
lsd_groups$lsd_statistics[[1]]
I could also convert the treatment names, which are stored as row.names, to a column.
I was hoping, though, for a more elegant solution using unnest(). Is there any way to instruct unnest() to keep those row names? Alternatively, is there a way to tell LSD.test to list the treatment names in a column instead of assigning them as row names? Thank you.
I want to select 3117 columns out of a data frame,
I tried to select them by column names:
dataframe %>%
select(
'AAACCTGAGCACGCCT-1',
'AAACCTGAGCGCTTAT-1',
'AAACCTGAGCGTTGCC-1',
......,
'TTGGAACCACGGACAA-1'
)
or
firstpickupnames <- ('AAACCTGAGCACGCCT-1','AAACCTGAGCGCTTAT-1','AAACCTGAGCGTTGCC-1',......,'TTGGAACCACGGACAA-1')
Both ways the R console just replied
'AAACCTGAGCACGCCT-1','AAACCTGAGCGCTTAT-1','AAACCTGAGCGTTGCC-
1',......,'TTGGAACCACGGACAA-1'
+ )
+
What does this mean? Is there a limitation of columns that I can select in R?
Without a reproducible example, it's difficult to know what exactly you're looking for, but dplyr::select() has several options for selecting columns, and dplyr::everything() might be what you're looking for:
library(dplyr)
# this reorders the column names, but keeps everything without having to name the columns specifically:
mtcars %>%
select(carb, gear, everything())
# from a list of column names:
keep_columns <- c('cyl','disp','hp')
mtcars %>%
select(one_of(keep_columns))
# specific names, and a range of names:
mtcars %>%
select(hp, qsec:gear)
#You could also use `contains()`, `starts_with()`, `ends_with()`, or `matches()`. Note that calling all of the following at once will give you no results:
mtcars %>%
select(contains('t')) %>%
select(starts_with('a')) %>%
select(ends_with('b')) %>%
select(matches('^m.+g$'))
The way that the console replies (with the + indicating that it is waiting for the rest of the expression) strongly suggests that you are encountering a limitation in the capacity for the console to process long commands (which you are attempting to assemble via pasting from the clipboard) rather than an inherent limit in the number of columns which can be selected. The only place I could find in the documentation to this limitation is here where it says "Command lines entered at the console are limited to about 4095 bytes."
In the comments you said that the column names that you wanted to select were in a csv file. You didn't say much about the structure of the csv file, but say that you have a csv file that contains a single list of column names. As an example, I created a file named "colnames.csv" which has a single line:
Sepal.Width, Petal.Length
Note that there is no need to manually place quote marks around the column names in the text file. Then in the R console I typed:
iris %>% select(one_of(as.character(read.csv("colnames.csv",header = FALSE, strip.white = TRUE,stringsAsFactors = FALSE))))
which worked as expected. Even though this example only used 2 columns, there is no reason that it should fail with 3000+, since the number of columns per se wasn't the problem with what you were doing.
If the structure of the csv file is different from the example then you would need to adjust the call to read.csv and perhaps the way that you convert it to a character vector, but you should be able to tweak this approach to your situation.