Read in a table that meets specific requirements in R - r

I am reading in data from a .txt file that contains over thousands of records
table1 <- read.table("teamwork.txt", sep ="|", fill = TRUE)
Looks like:
f_name l_name hours_worked code
Jim Baker 8.5 T
Richard Copton 4.5 M
Tina Bar 10 S
However I only want to read in data that has a 'S' or 'M' code:
I tried to concat the columns:
newdata <- subset(table1, code = 'S' |'M')
However I get this issue:
operations are possible only for numeric, logical or complex types

If there are thousands or tens of thousands of records (maybe not for millions), you should just be able to filter after you read in all the data:
> library(tidyverse)
> df %>% filter(code=="S"|code=="M")
# A tibble: 2 x 4
f_name l_name hours_worked code
<fct> <fct> <dbl> <fct>
1 Richard Copton 4.50 M
2 Tina Bar 10.0 S
If you really want to just pull in the rows that meet your condition, try sqldf package as in example here: How do i read only lines that fulfil a condition from a csv into R?

You can try
cols_g <- table1[which(table1$code == "S" | table1$code == "M",]
OR
cols_g <- subset(table1, code=="S" | code=="M")
OR
library(dplyr)
cols_g <- table1 %>% filter(code=="S" | code=="M")
If you want to add column cols_g on table1, you can use table1$cols_g assigned anything from these 3 methods instead of cols_g.

Related

R for loop to extract info from a file and add it into tibble?

I am not great with tidyverse so forgive me if this is a simple question. I have a bunch of files with data that I need to extract and add into distinct columns in a tibble I created.
I want the the row names to start with the file IDs which I did manage to create:
filelist <- list.fileS(pattern=".txt") # Gives me the filenames in current directory.
# The filenames are something like AA1230.report.txt for example
file_ID <- trimws(filelist, whitespace="\\..*") # Gives me the ID which is before the "report.txt"
metadata <- as_tibble(file_ID[1:181]) # create dataframe with IDs as row names for 180 files.
Now in these report files are information on species and abundance (kraken report files for those familiar with kraken) and all I need is to extract the number of reads for each domain. I can easily search up in each file the domains and number of reads that fall into that domain using something like:
sample_data <- as_tibble(read.table("AA1230.report.txt", sep="\t", header=FALSE, strip.white=TRUE))
sample_data <- rename(sample_data, Percentage=V1, Num_reads_root=V2, Num_reads_taxon=V3, Rank=V4, NCBI_ID=V5, Name=V6) # Just renaming the column headers for clarity
sample_data %>% filter(Rank=="D") # D for domain
This gives me a clear output such as:
Percentage Num_Reads_Root Num_Reads_Taxon Rank NCBI_ID Name
<dbl> <int> <int> <fct> <int> <fct>
1 75.9 60533 28 D 2 Bacteria
2 0.48 386 0 D 2759 Eukaryota
3 0.01 4 0 D 2157 Archaea
4 0.02 19 0 D 10239 Viruses
Now, I want to just grab the info in the second column and final column and save this info into my tibble so that I can get something like:
> metadata
value Bacteria_Counts Eukaryota_Counts Viruses_Counts Archaea_Counts
<chr> <int> <int> <int> <int>
1 AA1230 60533 386 19 4
2 AB0566
3 AA1231
4 AB0567
5 BC1148
6 AW0001
7 AW0002
8 BB1121
9 BC0001
10 BC0002
....with 171 more rows
I'm just having trouble coming up with a for loop to create these sample_data outputs, then from that, extract the info and place into a tibble. I guess my first loop should create these sample_data outputs so something like:
for (files in file.list()) {
>> get_domains <<
}
Then another loop to extract that info from the above loop and insert it into my metadata tibble.
Any suggestions? Thank you so much!
PS: If regular dataframes in R is better for this let me know, I have just recently learned that tidyverse is a better way to organize dataframes in R but I have to learn more about it.
You could also do:
library(tidyverse)
filelist <- list.files(pattern=".txt")
nms <- c("Percentage", "Num_reads_root", "Num_reads_taxon", "Rank", "NCBI_ID", "Name")
set_names(filelist,filelist) %>%
map_dfr(read_table, col_names = nms, .id = 'file_ID') %>%
filter(Rank == 'D') %>%
select(file_ID, Name, Num_reads_root) %>%
pivot_wider(id_cols = file_ID, names_from = Name, values_from = Num_reads_root) %>%
mutate(file_ID = str_remove(file_ID, '.txt'))
I've found that using a for loop is nice sometimes because saves all the progress along the way in case you hit an error. Then you can find the problem file and debug it or use try() but throw a warning().
library(tidyverse)
filelist <- list.files(pattern=".txt") #list files
tmp_list <- list()
for (i in seq_along(filelist)) {
my_table <- read_tsv(filelist[i]) %>% # It looks like your files are all .tsv's
rename(Percentage=V1, Num_reads_root=V2, Num_reads_taxon=V3, Rank=V4, NCBI_ID=V5, Name=V6) %>%
filter(Rank=="D") %>%
mutate(file_ID <- trimws(filelist[i], whitespace="\\..*")) %>%
select(file_ID, everything())
tmp_list[[i]] <- my_table
}
out <- bind_rows(tmp_list)
out

Recode values based on look up table with dplyr (R)

A relatively trivial question that has been bothering me for a while, but to which I have not yet found an answer - perhaps because I have trouble verbalizing the problem for search engines.
Here is a column of a data frame that contains identifiers.
data <- data.frame("id" = c("D78", "L30", "F02", "A23", "B45", "T01", "Q38", "S30", "K84", "O04", "P12", "Z33"))
Based on a lookup table, outdated identifiers are to be recoded into new ones. Here is an example look up table.
recode_table <- data.frame("old" = c("A23", "B45", "K84", "Z33"),
"new" = c("A24", "B46", "K88", "Z33"))
What I need now can be done with a merge or a loop. Here a loop example:
for(ID in recode_table$old) {
data[data$id == ID, "id"] <- recode_table[recode_table$old == ID, "new"]
}
But I am looking for a dplyr solution without having to use the " join" family. I would like something like this.
data <- mutate(data, id = ifelse(id %in% recode_table$old, filter(recode_table, old == id) %>% pull(new), id))
Obviously though, I can't use the column name ("id") of the table in order to identify the new ID.
References to corresponding passages in documentations or manuals are also appreciated. Thanks in advance!
You can use recode with unquote splicing (!!!) on a named vector
library(dplyr)
# vector of new IDs
recode_vec <- recode_table$new
# named with old IDs
names(recode_vec) <- recode_table$old
data %>%
mutate(id = recode(id, !!!recode_vec))
# id
# 1 D78
# 2 L30
# 3 F02
# 4 A24
# 5 B46
# 6 T01
# 7 Q38
# 8 S30
# 9 K88
# 10 O04
# 11 P12
# 12 Z33

Problem formatting spreadsheets in R, how can I read and write to tables using R?

I'm working with R for the first time for a class in college. To preface this: I don't know enough to know what I don't know, so I'm sorry if this question has been asked before. I am trying to predict the results of the Texas state house elections in 2020, and I think the best prior for that is the results of the 2018 state house elections. There are 150 races, so I can't bare to input them all by hand, but I can't find any spreadsheet that has data formatted how I want it. I want it in a pretty standard table format:
My desired table format. However, the table from the Secretary of state I have looks like the following:
Gross ugly table.
I wrote some psuedo code:
Here's the Psuedo Code, basically we want to construct a new CSV:
'''%First, we want to find a district, the house races are always preceded by a line of dashes, so I will need a function like this:
Create a New CSV;
for(x=1; x<151 ; x +=1){
Assign x to the cell under the district number cloumn;
Find "---------------" ;
Go down one line;
Go over two lines;
% We should now be in the third column and now want to read in which party got how many votes. The number of parties is not consistant, so we need to account for uncontested races, libertarians, greens, and write ins. I want totals for Republicans, Democrats, and Other.
while(cell is not empty){
Party <- function which reads cell (but I want to read a string);
go right one column;
Votes <- function which reads cell (but I want to read an integer);
if(Party = Rep){
put this data in place in new CSV;
else if (Party = Dem)
put this data in place in new CSV;
else
OtherVote += Votes;
};
};
Assign OtherVote to the column for other party;
OtherVote <- 0;
%Now I want to assign 0 to null cells (ones where no rep, or no Dem, or no other party contested
read through single row 4 spaces, if its null assign it 0;
Party <- null
};'''
But I don't know enough to google what to do! Here's what I need help with: Can I create a new CSV in Rstudio, how? How can I read specific cells in a table, hopefully indexing? Lastly, how do I write to a table in R. Any help is appreciated! Thank you!
Can I create a new CSV in Rstudio, how?
Yes you can. Use the "write.csv" function.
write.csv(df, file = "df.csv") #see help for more information.
How can I read specific cells in a table?
Use the brackets after df,example below.
df <- data.frame(x = c(1,2,3), y = c("A","B","C"), z = c(15,25,35))
df[1,1]
#[1] 1
df[1,1:2]
# x y
#1 1 A
How do I write to a table in R?
If you want to write a table in xlsx use the function write.xlsx from openxlsx package.
Wikipedia seems to have a table that is closer to the format you are looking for.
In order to get to the table you are looking for we need a few steps:
Download data from Wikipedia and extract table.
Clean up table.
Select columns.
Calculate margins.
1. Download data from wikipedia and extract table.
The rvest table helps with downloading and parsing websites into R objects.
First we download the HTML of the whole website.
library(dplyr)
library(rvest)
wiki_html <-
read_html(
"https://en.wikipedia.org/wiki/2018_United_States_House_of_Representatives_elections_in_Texas"
)
There are a few ways to get a specific object from an HTML file in this case
I dedided to look for the table that has the class name “wikitable plainrowheaders sortable”,
as I learned from inspecting the code, that the only table with that class is
the one we want to extract.
library(purrr)
html_nodes(wiki_html, "table") %>%
map_lgl( ~ html_attr(., "class") == "wikitable plainrowheaders sortable") %>%
which()
#> [1] 20
Then we can select table number 20 and convert it to a dataframe with html_table()
raw_table <-
html_nodes(wiki_html, "table")[[20]] %>%
html_table(fill = TRUE)
2. Clean up table.
The table has duplicated names, we can change that by using as_tibble() and its .name_repair argument. We then usedplyr::select() to get the columns. Furthermore we usedplyr::filter() to delete the first two rows, that have "District" as a value in theDistrictcolumn. Now the columns are still characters
vectors, but we need them to be numeric, therefore we first delete commas from
all columns and then transform columns 2 to 4 to numeric.
clean_table <-
raw_table %>%
as_tibble(.name_repair = "unique") %>%
filter(District != "District") %>%
mutate_all( ~ gsub(",", "", .)) %>%
mutate_at(2:4, as.numeric)
3. Select columns and 4. Calculate margins.
We use dplyr::select() to select the columns you are interested in and give them more helpful names.
Finally we calculate the margin between democratic and republican votes by first adding up there votes
as total_votes and then dividing the difference by total_votes.
clean_table %>%
select(District,
RepVote = Republican...2,
DemVote = Democratic...4,
OthVote = Others...6) %>%
mutate(
total_votes = RepVote + DemVote,
margin = abs(RepVote - DemVote) / total_votes * 100
)
#> # A tibble: 37 x 6
#> District RepVote DemVote OthVote total_votes margin
#> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 District 1 168165 61263 3292 229428 46.6
#> 2 District 2 139188 119992 4212 259180 7.41
#> 3 District 3 169520 138234 4604 307754 10.2
#> 4 District 4 188667 57400 3178 246067 53.3
#> 5 District 5 130617 78666 224 209283 24.8
#> 6 District 6 135961 116350 3731 252311 7.77
#> 7 District 7 115642 127959 0 243601 5.06
#> 8 District 8 200619 67930 4621 268549 49.4
#> 9 District 9 0 136256 16745 136256 100
#> 10 District 10 157166 144034 6627 301200 4.36
#> # … with 27 more rows
Edit: In case you want to go with the data provided by the state, it looks to me as if the data you are looking for is in the first, third and fourth column. So what you want to do is.
(All the code below is not tested, as I do not have the original data.)
read data into R
library(readr)
tx18 <- read_csv("filename.csv")
select relevant columns
tx18 <- tx18 %>%
select(c(1,3,4))
clean table
tx18 <- tx18 %>%
filter(!is.na(X3),
X3 != "Party",
X3 != "Race Total")
Group and summarize data by party
tx18 <- tx18 %>%
group_by(X3) %>%
summarise(votes = sum(X3))
Pivot/ Reshape data to wide format
tx18 %>$
pivot_wider(names_from = X3,
values_from = votes)
After this you could then calculate the margin similarly as I did with the Wikipedia data.

dplyr lookup table / pattern matching [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 5 years ago.
I was looking for a smart, or "tidier" way, to make use of a lookup table in the tidyverse, but could not find a satisfying solution.
I have a dataset and lookup table:
# Sample data
data <- data.frame(patients = 1:5,
treatment = letters[1:5],
hospital = c("yyy", "yyy", "zzz", "www", "uuu"),
response = rnorm(5))
# Lookup table
lookup <- tibble(hospital = c("yyy", "uuu"), patients = c(1,5))
... where each row in the lookup table is the exact pattern for which I want to filter the first tibble (data).
The wanted result would look like this:
# A tibble: 3 x 4
patients treatment hospital response
<dbl> <chr> <chr> <dbl>
1 1.00 a yyy -0.275
2 5.00 e uuu -0.0967
The easiest solution I came up with is something like this:
as.tibble(dat) %>%
filter(paste(hospital, patients) %in% paste(lookup$hospital, lookup$patients))
However, this must be something that a lot of people regularly do - is there a cleaner and more convienent way to do this (i.e. for more than two columns in your lookup table)?
Since the default behavior of dplyr::inner_join() is to match on common columns between the two tibbles passed to the function and the lookup table consists of only the 2 key columns, the shortest code is as follows:
library(dplyr)
# Sample data
data <- tibble(patients = 1:5,
treatment = letters[1:5],
hospital = c("yyy", "yyy", "zzz", "www", "uuu"),
response = rnorm(5))
# Lookup table
lookup <- tibble(hospital = c("yyy", "uuu"), patients = c(1,5))
data %>% inner_join(.,lookup)
...and the output:
> data %>% inner_join(.,lookup)
Joining, by = c("patients", "hospital")
# A tibble: 2 x 4
patients treatment hospital response
<dbl> <chr> <chr> <dbl>
1 1.00 a yyy -1.44
2 5.00 e uuu -0.313
>
Because the desired output can be accomplished by a join on key columns across the tibbles, the paste() code in the OP is unnecessary.
Also note that inner_join() is the right type of join because the desired output is rows that match across both incoming tibbles, and the lookup table does not have duplicate rows. If the lookup table contained duplicate rows, then semi_join() would be the appropriate function, per the comments on the OP.

Using R, Randomly Assigning Students Into Groups Of 4

I'm still learning R and have been given the task of grouping a long list of students into groups of four based on another variable. I have loaded the data into R as a data frame. How do I sample entire rows without replacement, one from each of 4 levels of a variable and have R output the data into a spreadsheet?
So far I have been tinkering with a for loop and the sample function but I'm quickly getting over my head. Any suggestions? Here is sample of what I'm attempting to do. Given:
Last.Name <- c("Picard","Troi","Riker","La Forge", "Yar", "Crusher", "Crusher", "Data")
First.Name <- c("Jean-Luc", "Deanna", "William", "Geordi", "Tasha", "Beverly", "Wesley", "Data")
Email <- c("a#a.com","b#b.com", "c#c.com", "d#d.com", "e#e.com", "f#f.com", "g#g.com", "h#h.com")
Section <- c(1,1,2,2,3,3,4,4)
df <- data.frame(Last.Name,First.Name,Email,Section)
I want to randomly select a Star Trek character from each section and end up with 2 groups of 4. I would want the entire row's worth of information to make it over to a new data frame containing all groups with their corresponding group number.
I'd use the wonderful package 'dplyr'
require(dplyr)
random_4 <- df %>% group_by(Section) %>% slice(sample(c(1,2),1))
random_4
Source: local data frame [4 x 4]
Groups: Section
Last.Name First.Name Email Section
1 Troi Deanna b#b.com 1
2 La Forge Geordi d#d.com 2
3 Crusher Beverly f#f.com 3
4 Data Data h#h.com 4
random_4
Source: local data frame [4 x 4]
Groups: Section
Last.Name First.Name Email Section
1 Picard Jean-Luc a#a.com 1
2 Riker William c#c.com 2
3 Crusher Beverly f#f.com 3
4 Data Data h#h.com 4
%>% means 'and then'
The code is read as:
Take DF AND THEN for all 'Section', select by position (slice) 1 or 2. Voila.
I suppose you have 8 students: First.Name <- c("Jean-Luc", "Deanna", "William", "Geordi", "Tasha", "Beverly", "Wesley", "Data").
If you wish to randomly assign a section number to the 8 students, and assuming you would like each section to have 2 students, then you can either permute Section <- c(1, 1, 2, 2, 3, 3, 4, 4) or permute the list of the students.
First approach, permute the sections:
> assigned_section <- print(sample(Section))
[1] 1 4 3 2 2 3 4 1
Then the following data frame gives the assignments:
assigned_students <- data.frame(First.Name, assigned_section)
Second approach, permute the students:
> assigned_students <- print(sample(First.Name))
[1] "Data" "Geordi" "Tasha" "William" "Deanna" "Beverly" "Jean-Luc" "Wesley"
Then, the following data frame gives the assignments:
assigned_students <- data.frame(assigned_students, Section)
Alex, Thank You. Your answer wasn't exactly what I was looking for, but it inspired the correct one for me. I had been thinking about the process from a far too complicated point of view. Instead of having R select rows and put them into a new data frame, I decided to have R assign a random number to each of the students and then sort the data frame by the number:
First, I broke up the data frame into sections:
df1<- subset(df, Section ==1)
df2<- subset(df, Section ==2)
df3<- subset(df, Section ==3)
df4<- subset(df, Section ==4)
Then I randomly generated a group number 1 through 4.
Groupnumber <-sample(1:4,4, replace=F)
Next, I told R to bind the columns:
Assigned1 <- cbind(df1,Groupnumber)
*Ran the group number generator and cbind in alternating order until I got through the whole set. (Wanted to make sure the order of the numbers was unique for each section).
Finally row binding the data set back together:
Final_List<-rbind(Assigned1,Assigned2,Assigned3,Assigned4)
Thank you everyone who looked this over. I am new to data science, R, and stackoverflow, but as I learn more I hope to return the favor.
I'd suggest the randomizr package to "block assign" according to section. The block_ra function lets you do this in a easy-to-read one-liner.
install.packages("randomizr")
library(randomizr)
df$group <- block_ra(block_var = df$Section,
condition_names = c("group_1", "group_2"))
You can inspect the resulting sets in a variety of ways. Here's with base r subsetting:
df[df$group == "group_1",]
Last.Name First.Name Email Section group
2 Troi Deanna b#b.com 1 group_1
3 Riker William c#c.com 2 group_1
6 Crusher Beverly f#f.com 3 group_1
7 Crusher Wesley g#g.com 4 group_1
df[df$group == "group_2",]
Last.Name First.Name Email Section group
1 Picard Jean-Luc a#a.com 1 group_2
4 La Forge Geordi d#d.com 2 group_2
5 Yar Tasha e#e.com 3 group_2
8 Data Data h#h.com 4 group_2
If you want to roll your own:
set <- tapply(1:nrow(df), df$Section, FUN = sample, size = 1)
df[set,] # show the sampled set
df[-set,] # show the complimentary set

Resources