I have multiple files where two and two files belong together and should be summed based on values in column 2 to create one file. All files have the same rows. The files that should be summed have similar ID before the L* part of the string.
I would like to make a loop that identifies the paired files and sums in based on column 2.
I have created a function that reads the files, but not sure how to proceed:
file_list <- list.files(pattern = "*.csv)
library(data.table)
lst <- lapply(file_list, function(x)
fread(x, select=c("V1", "V2"))[,
list(ID=paste(V1), freq=V2)])
Below is shown two of the pairs:
Pair one:
01_001_F08_S80_L009
16S_rRNA_copy_A-1 75
16S_rRNA_copy_B-1 86
16S_rRNA_copy_C-1 102
01_001_F08_S80_L002
16S_rRNA_copy_A-1 98
16S_rRNA_copy_B-1 96
16S_rRNA_copy_C-1 101
Pair two:
01_001_F09_S81_L006
16S_rRNA_copy_A-1 242
16S_rRNA_copy_B-1 244
16S_rRNA_copy_C-1 302
01_001_F09_S81_L003
16S_rRNA_copy_A-1 252
16S_rRNA_copy_B-1 253
16S_rRNA_copy_C-1 322
We can split the data by the substring of the names of the 'lst' (created with sub), loop through the list, rbind the nested list elements, grouped by 'ID', get the sum
lapply(split(lst, sub("\\d+$", "", names(lst))),
function(x) rbindlist(x)[, .(freq = sum(freq)), ID])
#$`01_001_F08_S80_L`
# ID freq
#1: 16S_rRNA_copy_A-1 173
#2: 16S_rRNA_copy_B-1 182
#3: 16S_rRNA_copy_C-1 203
#$`01_001_F09_S81_L`
# ID freq
#1: 16S_rRNA_copy_A-1 494
#2: 16S_rRNA_copy_B-1 497
#3: 16S_rRNA_copy_C-1 624
Related
An VERY simplified example of my dataset:
HUC8 YEAR RO_MM
1: 10010001 1961 78.2
2: 10010001 1962 84.0
3: 10010001 1963 70.2
4: 10010001 1964 130.5
5: 10010001 1965 54.3
I found this code online which sort of, but not quite, does what I want:
#create a list of the files from your target directory
file_list <- list.files(path="~/Desktop/Rprojects")
#initiate a blank data frame, each iteration of the loop will append the data from the given file to this variable
allHUCS <- data.frame()
#I want to read each .csv from a folder named "Rprojects" on my desktop into one huge dataframe for further use.
for (i in 1:length(file_list)){
temp_data <- fread(file_list[i], stringsAsFactors = F)
allHUCS <- rbindlist(list(allHUCS, temp_data), use.names = T)
}
Question: I have read that one should not use rbindlist for a large dataset:
"You should never ever ever iteratively rbind within a loop: performance might be okay in the beginning, but with each call to rbind it makes a complete copy of the data, so with each pass the total data to copy increases. It scales horribly. Consider do.call(rbind.data.frame, file_list)." – #r2evans
I know this may seem simple but I'm unclear about how to use his directive. Would I write this for the last line?
allHUCS <- do.call(rbind.data.frame(allHUCS, temp_data), use.names = T)
Or something else? In my actual data, each .csv has 2099 objects with 3 variables (but I only care about the last two.) The total dataframe should contain 47,000,000+ objects of 2 variables. When I ran the original code I got these errors:
Error in rbindlist(list(allHUCS, temp_data), use.names = T) : Item 2
has 2 columns, inconsistent with item 1 which has 3 columns. To fill
missing columns use fill=TRUE.
In addition: Warning messages: 1: In fread(file_list[i],
stringsAsFactors = F) : Detected 1 column names but the data has 2
columns (i.e. invalid file). Added 1 extra default column name for the
first column which is guessed to be row names or an index. Use
setnames() afterwards if this guess is not correct, or fix the file
write command that created the file to create a valid file.
2: In fread(file_list[i], stringsAsFactors = F) : Stopped early on
line 20. Expected 2 fields but found 3. Consider fill=TRUE and
comment.char=. First discarded non-empty line: <<# mv *.csv .. ; >>
Except for the setnames() suggestion, I don't understand what I'm being told. I know it says it stopped early, but I don't even know how to see the entire dataset or to tell where it stopped.
I'm now reading that rbindlist and rbind are two different things and rbindlist is faster than do.call(rbind, data). But the suggestion is do.call(rbind.data.frame(allHUCS, temp_data). Which is going to be fastest?
Since the original post does not include a reproducible example, here is one that reads data from the Pokémon Stats data that I maintain on Github.
First, we download a zip file containing one CSV file for each generation of Pokémon, and unzip it to the ./pokemonData subdirectory of the R working directory.
download.file("https://raw.githubusercontent.com/lgreski/pokemonData/master/PokemonData.zip",
"pokemonData.zip",
method="curl",mode="wb")
unzip("pokemonData.zip",exdir="./pokemonData")
Next, we obtain a list of files in the directory to which we unzipped the CSV files.
thePokemonFiles <- list.files("./pokemonData",
full.names=TRUE)
Finally, we load the data.table package, use lapply() with data.table::fread() to read the files, combine the resulting list of data tables with do.call(), and print the head() and `tail() of the resulting data frame with all 8 generations of Pokémon stats.
library(data.table)
data <- do.call(rbind,lapply(thePokemonFiles,fread))
head(data)
tail(data)
...and the output:
> head(data)
ID Name Form Type1 Type2 Total HP Attack Defense Sp. Atk Sp. Def Speed
1: 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45
2: 2 Ivysaur Grass Poison 405 60 62 63 80 80 60
3: 3 Venusaur Grass Poison 525 80 82 83 100 100 80
4: 4 Charmander Fire 309 39 52 43 60 50 65
5: 5 Charmeleon Fire 405 58 64 58 80 65 80
6: 6 Charizard Fire Flying 534 78 84 78 109 85 100
Generation
1: 1
2: 1
3: 1
4: 1
5: 1
6: 1
> tail(data)
ID Name Form Type1 Type2 Total HP Attack Defense Sp. Atk
1: 895 Regidrago Dragon 580 200 100 50 100
2: 896 Glastrier Ice 580 100 145 130 65
3: 897 Spectrier Ghost 580 100 65 60 145
4: 898 Calyrex Psychic Grass 500 100 80 80 80
5: 898 Calyrex Ice Rider Psychic Ice 680 100 165 150 85
6: 898 Calyrex Shadow Rider Psychic Ghost 680 100 85 80 165
Sp. Def Speed Generation
1: 50 80 8
2: 110 30 8
3: 80 130 8
4: 80 80 8
5: 130 50 8
6: 100 150 8
>
I have a list of lists, like so:
x <-list()
x[[1]] <- c('97', '342', '333')
x[[2]] <- c('97','555','556','742','888')
x[[3]] <- c ('100', '442', '443', '444', '445','446')
The first number in each list (97, 97, 100) refers to a node in a tree and the following numbers refer to traits associated with that node.
My goal is to create a dataframe that looks like this:
df= data.frame(node = c('97','97','97','97','97','97','100','100','100','100','100'),
trait = c('342','333','555','556','742','888','442','443','444','445','446'))
where each trait has its corresponding node.
I think the first thing I need to do is convert the list of lists into a single dataframe. I've tried doing so using:
do.call(rbind,x)
but that repeats the values in x[[1]] and x[[2]] to match the length of x[[3]]. I've also tried using:
dt_list <- map(x, as.data.table)
dt <- rbindlist(dt_list, fill = TRUE, idcol = T)
Which I think gets me closer, but I'm still unsure of how to assign the first node value to the corresponding trait values. I know this is probably a simple task but it's stumping me today!
Maybe you can try the code below
h <- sapply(x, `[`,1)
d <- lapply(x, `[`,-1)
df <- data.frame(node = rep(h,lengths(d)), trait = unlist(d))
such that
> df
node trait
1 97 342
2 97 333
3 97 555
4 97 556
5 97 742
6 97 888
7 100 442
8 100 443
9 100 444
10 100 445
11 100 446
You can create a data frame with the first value from the vector in column 'node' and the rest of the values in column 'trait'. This strategy can be applied to all entries in the list using the map_df() function from purrr package, giving the output you describe.
library(purrr)
library(dplyr)
x %>%
map_df(., function(vec) data.frame(node = vec[1],
trait = vec[-1],
stringsAsFactors = F))
An option with base R is
stack(setNames(lapply(x, `[`, -1), sapply(x, `[`, 1)))[2:1]
# ind values
#1 97 342
#2 97 333
#3 97 555
#4 97 556
#5 97 742
#6 97 888
#7 100 442
#8 100 443
#9 100 444
#10 100 445
#11 100 446
Another solution
library(tidyverse)
library(purrr)
node <- map(x, ~rep(.x[1], length(.x)-1)) %>% flatten_chr()
trait <- map(x, ~.x[2:length(.x)]) %>% flatten_chr()
out <- tibble(node, trait)
node trait
<chr> <chr>
1 97 342
2 97 333
3 97 555
4 97 556
5 97 742
6 97 888
7 100 442
8 100 443
9 100 444
10 100 445
11 100 446
I have a df RawDat with two rows ID, data. I want to grep() my data by the id using e.g. lapply() to generate a new df where the data is sorted into columns by their id:
My df looks like this, except I have >80000 rows, and 75 ids:
ID data
abl 564
dlh 78
vho 354
mez 15
abl 662
dlh 69
vho 333
mez 9
.
.
.
I can manually extract the data using the grep() function:
ExtRawDat = as.data.frame(RawDat[grep("abl",RawDat$ID),])
However, I would not want to do that 75 times and cbind() them. Rather, I would like to use the lapply() function to automate it. I have tried several variations of the following code, but I don't get a script that provide the desired output.
I have a vector with the 75 ids ProLisV, to loop my argument
ExtRawDat = as.data.frame(lapply(ProLisV[1:75],function(x){
Temp1 = RawDat[grep(x,RawDat$ID),] # The issue is here, the pattern is not properly defined with the X input (is it detrimental that some of the names in the list having spaces etc.?)
Values = as.data.frame(Temp1$data)
list(Values$data)
}))
The desired output looks like this:
abl dlh vho mez ...
564 78 354 15
662 69 333 9
.
.
.
How do I adjust that function to provide the desired output? Thank you.
It looks like what you are trying to do is to convert your data from long form to wide form. One way to do this easily is to use the spread function from the tidyr package. To use it, we need a column to remove duplicate identifiers, so we'll first add a grouping variable:
n.ids <- 4 # With your full data this should be 75
df$group <- rep(1:n.ids, each = n.ids, length.out = nrow(df))
tidyr::spread(df, ID, data)
# group abl dlh mez vho
# 1 1 564 78 15 354
# 2 2 662 69 9 333
If you don't want the group column at the end, just do df$group <- NULL.
Data
df <- read.table(text = "
ID data
abl 564
dlh 78
vho 354
mez 15
abl 662
dlh 69
vho 333
mez 9", header = T)
I would like to understand how to subset multiple columns from same data frame by matching the first 5 letters of the column names with each other and if they are equal then subset it and store it in a new variable.
Here is a small explanation of my required output. It is described below,
Lets say the data frame is eatable
fruits_area fruits_production vegetable_area vegetable_production
12 100 26 324
33 250 40 580
66 510 43 581
eatable <- data.frame(c(12,33,660),c(100,250,510),c(26,40,43),c(324,580,581))
names(eatable) <- c("fruits_area", "fruits_production", "vegetables_area",
"vegetable_production")
I was trying to write a function which will match the strings in a loop and will store the subset columns after matching first 5 letters from the column names.
checkExpression <- function(dataset,str){
dataset[grepl((str),names(dataset),ignore.case = TRUE)]
}
checkExpression(eatable,"your_string")
The above function checks the string correctly but I am confused how to do matching among the column names in the dataset.
Edit:- I think regular expressions would work here.
You could try:
v <- unique(substr(names(eatable), 0, 5))
lapply(v, function(x) eatable[grepl(x, names(eatable))])
Or using map() + select_()
library(tidyverse)
map(v, ~select_(eatable, ~matches(.)))
Which gives:
#[[1]]
# fruits_area fruits_production
#1 12 100
#2 33 250
#3 660 510
#
#[[2]]
# vegetables_area vegetable_production
#1 26 324
#2 40 580
#3 43 581
Should you want to make it into a function:
checkExpression <- function(df, l = 5) {
v <- unique(substr(names(df), 0, l))
lapply(v, function(x) df[grepl(x, names(df))])
}
Then simply use:
checkExpression(eatable, 5)
I believe this may address your needs:
checkExpression <- function(dataset,str){
cols <- grepl(paste0("^",str),colnames(dataset),ignore.case = TRUE)
subset(dataset,select=colnames(dataset)[cols])
}
Note the addition of "^" to the pattern used in grepl.
Using your data:
checkExpression(eatable,"fruit")
## fruits_area fruits_production
##1 12 100
##2 33 250
##3 660 510
checkExpression(eatable,"veget")
## vegetables_area vegetable_production
##1 26 324
##2 40 580
##3 43 581
Your function does exactly what you want but there was a small error:
checkExpression <- function(dataset,str){
dataset[grepl((str),names(dataset),ignore.case = TRUE)]
}
Change the name of the object from which your subsetting from obje to dataset.
checkExpression(eatable,"fr")
# fruits_area fruits_production
#1 12 100
#2 33 250
#3 660 510
checkExpression(eatable,"veg")
# vegetables_area vegetable_production
#1 26 324
#2 40 580
#3 43 581
I am trying to edit my column inside the dataframe i tried using tstrsplit but I didnt get the desired result. i am trying to remove ';' from OID & i want single value in every row in OID column.
this is my code below i did
library(data.table);
setDT(df)[, paste0("OID", 1:3) := tstrsplit(OID, ";", fixed = TRUE)]
doing this code it created 3 different columns OID1 OID2 OID3 but i need to only edit column OID & have single values in it has displayed below in my desired output.
here below is my data-->
QID OID
189 204;202;201;203;
189 202;203;201;204;
189 na
189 204;202;201;203;
189 na
189 204;202;201;203;
189 na
my desired output what i need is below-->
QID OID
189 202
189 201
189 204
189 203
If we need a single element from each row, we can split the 'OID' by ;, loop through the list output with sapply, get a single element with (sample - as the rules are not clear), and update the 'OID' with that output.
transform(df, OID = sapply(strsplit(OID, ";"), sample, 1))
# QID OID
#1 189 202
#2 189 204
#3 189 203
#4 189 202
If we need unique values per row
transform(df, OID = sample(unique(unlist(strsplit(OID, ";")))))
# QID OID
#1 189 202
#2 189 201
#3 189 203
#4 189 204
NOTE: If the "OID" column class is factor, convert to character class before splitting i.e. strsplit(as.character(OID), ";")
data
df <- structure(list(QID = c(189L, 189L, 189L, 189L),
OID = c("204;202;201;203;",
"202;203;201;204;", "204;202;201;203;", "204;202;201;203;")),
.Names = c("QID", "OID"), class = "data.frame", row.names = c(NA, -4L))
I think another option is using the library stringr::str_split_fixed, it vectorised over string, so it should be more efficient than sapply.
str_split_fixed(string, pattern, n)
Please see here: http://www.inside-r.org/packages/cran/stringr/docs/str_split_fixed
df <- data.frame(QID=c(189,189,189,189),
OID=c("204;202;201;203","202;203;201;204",
"204;202;201;203","204;202;201;203"))
df
# QID OID
# 1 189 204;202;201;203
# 2 189 202;203;201;204
# 3 189 204;202;201;203
# 4 189 204;202;201;203
library(stringr)
df$OID = str_split_fixed(df$OID, ";",4)[,1] #get the first seperated column
df
# QID OID
#1 189 204
#2 189 202
#3 189 204
#4 189 204