I'm importing several .csvs that are all two columns wide (they're output from a program) - the first column is wavelength and the second is absorbance, but I'm naming it by the file name to be combined later like from this old stack overflow answer (Combining csv files in R to different columns). The incoming .csvs don't have headers, and I'm aware that the way I'm naming them crops the first data points. I would like for the first column to not have any decimals and standardize all of the numbers to four digits - the code I've added works on its own but not in this block - and I would prefer to do this formatting all in one go. I run into errors with $ not being the right operator, but when I use [] I get errors about that too. The column I need to do this to is the first and it's named 'Wavelength' - which also gives me errors either because wavelength doesn't exist or it's nonnumeric. Any ideas?
This is what my script currently looks like:
for (file in file_list) {
f <- sub("(.*)\\.CSV", "\\1", file)
assign(f, read.csv(file = file))
assign(f, setNames(get(f), c(names(get(f))[0:0], "Wavelength")))
assign(f, setNames(get(f), c(names(get(f))[1:1], file)))
floor(f[Wavelength]) #the issues are here
sprintf("%04d", f$Wavelength) #and here
}
The data looks like this in the csv before it gets processed:
1 401.7664 0.1379457
2 403.8058 0.1390427
3 405.8452 0.1421666
4 407.8847 0.1463629
5 409.9241 0.1477264
I would like the output to be:
Wavelength (file name)
1 0401 0.1379457
2 0403 0.1390427
3 0405 0.1421666
4 0407 0.1463629
5 0409 0.1477264
And here's the dput that r2evans asked for:
structure(list(X3.997270e.002 = c(401.7664, 403.8058, 405.8452,
407.8847, 409.9241, 411.9635), X1.393858e.001 = c(0.1379457,
0.1390427, 0.1421666, 0.1463629, 0.1477264, 0.1476971)), row.names =
c(NA,
6L), class = "data.frame")
Thanks in advance!
6/24 Update:
When I assign the column name "Wavelength" it only gets added as a character, not as a real column name? When I dput/head the files once they go through (omitting the sprintf/floor functions) it only lists the file name (the second column). When I open the csvs in R studio the first column is properly labeled - and even further I'm able to combine all the csvs sorted by "Wavelength":
list_csvs <- mget(sub("(.*)\\.CSV", "\\1", file_list))
all_csvs <- Reduce(function(x, y) merge(x, y, all=T,
by=c("Wavelength")), list_csvs, accumulate=F)
Naturally I've thought about just formatting the column after this, but some of the decimals are off in the thousands place so I do need to format before I merge the csvs.
I've updated the code to use colnames outside of the read.csv:
for (file in file_list) {
f <- sub("(.*)\\.CSV", "\\1", file)
assign(f, read.csv(file = file,
header = FALSE,
row.names = NULL))
colnames(f) <- c("Wavelength", file)
print(summary(f))
print(names(f))
#floor("Wavelength") #I'm omitting this to see the console errors
#sprintf("%04.0f", f["Wavelength"]) #omitting this too
}
but I get the following error:
attempt to set 'colnames' on an object with less than two dimensions
Without the naming bit and without the sprintf/floor I get this back from the summary and names prompt for each file:
Length Class Mode
1 character character
NULL
When I try to call out the first column by f[1], f[[1]], f[,1], or f[[,1]] I get error messages about 'incorrect number of dimensions'. I can clearly see in the R environment that each data frame has a length of 2. I also double checked with .row_names_info(f) that the first column isn't being read as row names. What am I doing wrong?
I'm going to suggest a dplyr/tidyr pipe for this.
First, data-setup:
writeLines(
"401.7664,0.1379457
403.8058,0.1390427
405.8452,0.1421666
407.8847,0.1463629
409.9241,0.1477264", "sample1.csv")
file.copy("sample1.csv", "sample2.csv")
file_list <- normalizePath(list.files(pattern = ".*\\.csv$", full.names = TRUE), winslash = "/")
file_list
# [1] "C:/Users/r2/StackOverflow/13765634/sample1.csv"
# [2] "C:/Users/r2/StackOverflow/13765634/sample2.csv"
First, I'm going to suggest a slightly different format: not naming the column for the filename. I like this because I'm still going to preserve the filename with the data (as a category, so to speak), but it allows you to combine all of your data into one frame for more efficient processing:
library(dplyr)
library(purrr) # map*
library(tidyr) # pivot_wider
file_list %>%
set_names(.) %>%
# set_names(tools::file_path_sans_ext(basename(.))) %>%
map_dfr(~ read.csv(.x, header = FALSE, col.names = c("freq", "val")),
.id = "filename") %>%
mutate(freq = sprintf("%04.0f", freq))
# filename freq val
# 1 C:/Users/r2/StackOverflow/13765634/sample1.csv 0402 0.1379457
# 2 C:/Users/r2/StackOverflow/13765634/sample1.csv 0404 0.1390427
# 3 C:/Users/r2/StackOverflow/13765634/sample1.csv 0406 0.1421666
# 4 C:/Users/r2/StackOverflow/13765634/sample1.csv 0408 0.1463629
# 5 C:/Users/r2/StackOverflow/13765634/sample1.csv 0410 0.1477264
# 6 C:/Users/r2/StackOverflow/13765634/sample2.csv 0402 0.1379457
# 7 C:/Users/r2/StackOverflow/13765634/sample2.csv 0404 0.1390427
# 8 C:/Users/r2/StackOverflow/13765634/sample2.csv 0406 0.1421666
# 9 C:/Users/r2/StackOverflow/13765634/sample2.csv 0408 0.1463629
# 10 C:/Users/r2/StackOverflow/13765634/sample2.csv 0410 0.1477264
Options: if you prefer just the filename (no path) and are certain that there is no filename collision, then use set_names(basename(.)) instead. (This step is really necessary when using the filename as a column name anyway.) I'll also remove the file extension, since they're likely all .csv or similar.
file_list %>%
# set_names(.) %>%
set_names(tools::file_path_sans_ext(basename(.))) %>%
map_dfr(~ read.csv(.x, header = FALSE, col.names = c("freq", "val")),
.id = "filename") %>%
mutate(freq = sprintf("%04.0f", freq))
# filename freq val
# 1 sample1 0402 0.1379457
# 2 sample1 0404 0.1390427
# 3 sample1 0406 0.1421666
# 4 sample1 0408 0.1463629
# 5 sample1 0410 0.1477264
# 6 sample2 0402 0.1379457
# 7 sample2 0404 0.1390427
# 8 sample2 0406 0.1421666
# 9 sample2 0408 0.1463629
# 10 sample2 0410 0.1477264
(If you need to do something to each dataset at a time, then you should use %>% group_by(filename), not sure if that's relevant.)
If you really need the filename to be the column name of the value, then modify this slightly so that it preserves it as a list:
file_list %>%
set_names(tools::file_path_sans_ext(basename(.))) %>%
map(~ read.csv(.x, header = FALSE, col.names = c("freq", "val"))) %>%
map2(., names(.), ~ transmute(.x, freq = sprintf("%04.0f", freq), !!.y := val))
# $sample1
# freq sample1
# 1 0402 0.1379457
# 2 0404 0.1390427
# 3 0406 0.1421666
# 4 0408 0.1463629
# 5 0410 0.1477264
# $sample2
# freq sample2
# 1 0402 0.1379457
# 2 0404 0.1390427
# 3 0406 0.1421666
# 4 0408 0.1463629
# 5 0410 0.1477264
But I'm going to infer that ultimately you want to combine these column-wise, assuming there will be alignment in the freq column. (I can't think of another reason why you'd want the column name to be the filename.)
For that, try this, reverting to the first use of map_dfr, introducing pivot_wider:
file_list %>%
set_names(tools::file_path_sans_ext(basename(.))) %>%
map_dfr(~ read.csv(.x, header = FALSE, col.names = c("freq", "val")),
.id = "filename") %>%
mutate(freq = sprintf("%04.0f", freq)) %>%
pivot_wider(freq, names_from = filename, values_from = val)
# # A tibble: 5 x 3
# freq sample1 sample2
# <chr> <dbl> <dbl>
# 1 0402 0.138 0.138
# 2 0404 0.139 0.139
# 3 0406 0.142 0.142
# 4 0408 0.146 0.146
# 5 0410 0.148 0.148
Notes (perhaps more of a soap-box):
Regarding your use of assign, I strongly discourage this behavior. Since the data is effectively all structured the same, I infer that you'll be doing the same thing to each of these files. In that case, it is much better to use one of the *apply functions on a list of data.frames. That is, instead of having to iterate over a list of variable names, get it, do something, then reassign it ... it is often much easier (to program, to read, to maintain) dats <- lapply(dats, some_function) or dats2 <- lapply(dats, function(x) { ...; x; }).
Regarding the use of filename-as-column-name. Some tools (e.g., ggplot2) really benefit from having "long" data (i.e., one or more category columns such as filename, and one column for each type of data ... type is relative to your understanding of the data). You might benefit from reframing your thinking on working with this data.
Related
I'm trying to make objects directly from information listed in a tibble that can be called on by later functions/tibbles in my environment. I can make the objects manually but I'm working to do this iteratively.
library(tidyverse)
##determine mean from 2x OD Negatives in experimental plates, then save summary for use in appending table
ELISA_negatives = "my_file.csv"
neg_tibble <- as_tibble(read_csv(ELISA_negatives, col_names = TRUE)) %>%
group_by(Species_ab, Antibody, Protein) %>%
filter(str_detect(Animal_ID, "2x.*")) %>%
summarize(ave_neg_U_mL = mean(U_mL, na.rm = TRUE), n=sum(!is.na(U_mL)))
neg_tibble
# A tibble: 4 x 5
# Groups: Species_ab, Antibody [2]
Species_ab Antibody Protein ave_neg_U_mL n
<chr> <chr> <chr> <dbl> <int>
1 Mouse IgG GP 28.2 6
2 Mouse IgG NP 45.9 6
3 Rat IgG GP 5.24 4
4 Rat IgG NP 1.41 1
I can write the object manually based off the above tibble:
Mouse_IgG_GP_cutoff <- as.numeric(neg_tibble[1,4])
Mouse_IgG_GP_cutoff
[1] 28.20336
In my attempt to do this iteratively, I can make a new tibble neg_tibble_string with the information I need. All I would need to do now is make a global object from the Name in the first column Test_Name, and assign it to the numeric value in the second column ave_neg_U_mL (which is where I'm getting stuck).
neg_tibble_string <- neg_tibble %>%
select(Species_ab:Protein) %>%
unite(col='Test_Name', c('Species_ab', 'Antibody', 'Protein'), sep = "_") %>%
mutate(Test_Name = str_c(Test_Name, "_cutoff")) %>%
bind_cols(neg_tibble[4])
neg_tibble_string
# A tibble: 4 x 2
Test_Name ave_neg_U_mL
<chr> <dbl>
1 Mouse_IgG_GP_cutoff 28.2
2 Mouse_IgG_NP_cutoff 45.9
3 Rat_IgG_GP_cutoff 5.24
4 Rat_IgG_NP_cutoff 1.41
I feel like there has to be a way to do this to get this from the above tibble neg_tibble_string, and make this for all four of the rows. I've tried a variant of this and this, but can't get anywhere.
> list_df <- mget(ls(pattern = "neg_tibble_string"))
> list_output <- map(list_df, ~neg_tibble_string$ave_neg_U_mL)
Warning message:
Unknown or uninitialised column: `ave_neg_U_mL`.
> list_output
$neg_tibble_string
NULL
As always, any insight is appreciated! I'm making progress on my R journey but I know I am missing large gaps in knowledge.
As we already returned the object value in a list, we need only to specify the lambda function i.e. .x returns the value of the list element which is a tibble and extract the column
library(purrr)
list_output <- map(list_df, ~.x$ave_neg_U_ml)
If the intention is to create global objects, deframe, convert to a list and then use list2env
library(tibble)
list2env(as.list(deframe(neg_tibble_string)), .GlobalEnv)
I am not great with tidyverse so forgive me if this is a simple question. I have a bunch of files with data that I need to extract and add into distinct columns in a tibble I created.
I want the the row names to start with the file IDs which I did manage to create:
filelist <- list.fileS(pattern=".txt") # Gives me the filenames in current directory.
# The filenames are something like AA1230.report.txt for example
file_ID <- trimws(filelist, whitespace="\\..*") # Gives me the ID which is before the "report.txt"
metadata <- as_tibble(file_ID[1:181]) # create dataframe with IDs as row names for 180 files.
Now in these report files are information on species and abundance (kraken report files for those familiar with kraken) and all I need is to extract the number of reads for each domain. I can easily search up in each file the domains and number of reads that fall into that domain using something like:
sample_data <- as_tibble(read.table("AA1230.report.txt", sep="\t", header=FALSE, strip.white=TRUE))
sample_data <- rename(sample_data, Percentage=V1, Num_reads_root=V2, Num_reads_taxon=V3, Rank=V4, NCBI_ID=V5, Name=V6) # Just renaming the column headers for clarity
sample_data %>% filter(Rank=="D") # D for domain
This gives me a clear output such as:
Percentage Num_Reads_Root Num_Reads_Taxon Rank NCBI_ID Name
<dbl> <int> <int> <fct> <int> <fct>
1 75.9 60533 28 D 2 Bacteria
2 0.48 386 0 D 2759 Eukaryota
3 0.01 4 0 D 2157 Archaea
4 0.02 19 0 D 10239 Viruses
Now, I want to just grab the info in the second column and final column and save this info into my tibble so that I can get something like:
> metadata
value Bacteria_Counts Eukaryota_Counts Viruses_Counts Archaea_Counts
<chr> <int> <int> <int> <int>
1 AA1230 60533 386 19 4
2 AB0566
3 AA1231
4 AB0567
5 BC1148
6 AW0001
7 AW0002
8 BB1121
9 BC0001
10 BC0002
....with 171 more rows
I'm just having trouble coming up with a for loop to create these sample_data outputs, then from that, extract the info and place into a tibble. I guess my first loop should create these sample_data outputs so something like:
for (files in file.list()) {
>> get_domains <<
}
Then another loop to extract that info from the above loop and insert it into my metadata tibble.
Any suggestions? Thank you so much!
PS: If regular dataframes in R is better for this let me know, I have just recently learned that tidyverse is a better way to organize dataframes in R but I have to learn more about it.
You could also do:
library(tidyverse)
filelist <- list.files(pattern=".txt")
nms <- c("Percentage", "Num_reads_root", "Num_reads_taxon", "Rank", "NCBI_ID", "Name")
set_names(filelist,filelist) %>%
map_dfr(read_table, col_names = nms, .id = 'file_ID') %>%
filter(Rank == 'D') %>%
select(file_ID, Name, Num_reads_root) %>%
pivot_wider(id_cols = file_ID, names_from = Name, values_from = Num_reads_root) %>%
mutate(file_ID = str_remove(file_ID, '.txt'))
I've found that using a for loop is nice sometimes because saves all the progress along the way in case you hit an error. Then you can find the problem file and debug it or use try() but throw a warning().
library(tidyverse)
filelist <- list.files(pattern=".txt") #list files
tmp_list <- list()
for (i in seq_along(filelist)) {
my_table <- read_tsv(filelist[i]) %>% # It looks like your files are all .tsv's
rename(Percentage=V1, Num_reads_root=V2, Num_reads_taxon=V3, Rank=V4, NCBI_ID=V5, Name=V6) %>%
filter(Rank=="D") %>%
mutate(file_ID <- trimws(filelist[i], whitespace="\\..*")) %>%
select(file_ID, everything())
tmp_list[[i]] <- my_table
}
out <- bind_rows(tmp_list)
out
I am beginner of R. I need to transfer some Eviews code to R. There are some loop code to add 10 or more columns\variables with some function in data in Eviews.
Here are eviews example code to estimate deflator:
for %x exp con gov inv cap ex im
frml def_{%x} = gdp_{%x}/gdp_{%x}_r*100
next
I used dplyr package and use mutate function. But it is very hard to add many variables.
library(dplyr)
nominal_gdp<-rnorm(4)
nominal_inv<-rnorm(4)
nominal_gov<-rnorm(4)
nominal_exp<-rnorm(4)
real_gdp<-rnorm(4)
real_inv<-rnorm(4)
real_gov<-rnorm(4)
real_exp<-rnorm(4)
df<-data.frame(nominal_gdp,nominal_inv,
nominal_gov,nominal_exp,real_gdp,real_inv,real_gov,real_exp)
df<-df %>% mutate(deflator_gdp=nominal_gdp/real_gdp*100,
deflator_inv=nominal_inv/real_inv,
deflator_gov=nominal_gov/real_gov,
deflator_exp=nominal_exp/real_exp)
print(df)
Please help me to this in R by loop.
The answer is that your data is not as "tidy" as it could be.
This is what you have (with an added observation ID for clarity):
library(dplyr)
df <- data.frame(nominal_gdp = rnorm(4),
nominal_inv = rnorm(4),
nominal_gov = rnorm(4),
real_gdp = rnorm(4),
real_inv = rnorm(4),
real_gov = rnorm(4))
df <- df %>%
mutate(obs_id = 1:n()) %>%
select(obs_id, everything())
which gives:
obs_id nominal_gdp nominal_inv nominal_gov real_gdp real_inv real_gov
1 1 -0.9692060 -1.5223055 -0.26966202 0.49057546 2.3253066 0.8761837
2 2 1.2696927 1.2591910 0.04238958 -1.51398652 -0.7209661 0.3021453
3 3 0.8415725 -0.1728212 0.98846942 -0.58743294 -0.7256786 0.5649908
4 4 -0.8235101 1.0500614 -0.49308092 0.04820723 -2.0697008 1.2478635
Consider if you had instead, in df2:
obs_id variable real nominal
1 1 gdp 0.49057546 -0.96920602
2 2 gdp -1.51398652 1.26969267
3 3 gdp -0.58743294 0.84157254
4 4 gdp 0.04820723 -0.82351006
5 1 inv 2.32530662 -1.52230550
6 2 inv -0.72096614 1.25919100
7 3 inv -0.72567857 -0.17282123
8 4 inv -2.06970078 1.05006136
9 1 gov 0.87618366 -0.26966202
10 2 gov 0.30214534 0.04238958
11 3 gov 0.56499079 0.98846942
12 4 gov 1.24786355 -0.49308092
Then what you want to do is trivial:
df2 %>% mutate(deflator = real / nominal)
obs_id variable real nominal deflator
1 1 gdp 0.49057546 -0.96920602 -0.50616221
2 2 gdp -1.51398652 1.26969267 -1.19240392
3 3 gdp -0.58743294 0.84157254 -0.69801819
4 4 gdp 0.04820723 -0.82351006 -0.05853872
5 1 inv 2.32530662 -1.52230550 -1.52749012
6 2 inv -0.72096614 1.25919100 -0.57256297
7 3 inv -0.72567857 -0.17282123 4.19901294
8 4 inv -2.06970078 1.05006136 -1.97102841
9 1 gov 0.87618366 -0.26966202 -3.24919196
10 2 gov 0.30214534 0.04238958 7.12782060
11 3 gov 0.56499079 0.98846942 0.57158146
12 4 gov 1.24786355 -0.49308092 -2.53074800
So the question becomes: how do we get to the nice dplyr-compatible data.frame.
You need to gather your data using tidyr::gather. However, because you have 2 sets of variables to gather (the real and nominal values), it is not straightforward. I have done it in two steps, there may be a better way though.
real_vals <- df %>%
select(obs_id, starts_with("real")) %>%
# the line below is where the magic happens
tidyr::gather(variable, real, starts_with("real")) %>%
# extracting the variable name (by erasing up to the underscore)
mutate(variable = gsub(variable, pattern = ".*_", replacement = ""))
# Same thing for nominal values
nominal_vals <- df %>%
select(obs_id, starts_with("nominal")) %>%
tidyr::gather(variable, nominal, starts_with("nominal")) %>%
mutate(variable = gsub(variable, pattern = ".*_", replacement = ""))
# Merging them... Now we have something we can work with!
df2 <-
full_join(real_vals, nominal_vals, by = c("obs_id", "variable"))
Note the importance of the observation id when merging.
We can grep the matching names, and sort:
x <- colnames(df)
df[ sort(x[ (grepl("^nominal", x)) ]) ] /
df[ sort(x[ (grepl("^real", x)) ]) ] * 100
Similarly, if the columns were sorted, then we could just:
df[ 1:4 ] / df[ 5:8 ] * 100
We can loop over column names using purrr::map_dfc then apply a custom function over the selected columns (i.e. the columns that matched the current name from nms)
library(dplyr)
library(purrr)
#Replace anything before _ with empty string
nms <- unique(sub('.*_','',names(df)))
#Use map if you need the ouptut as a list not a dataframe
map_dfc(nms, ~deflator_fun(df, .x))
Custom function
deflator_fun <- function(df, x){
#browser()
nx <- paste0('nominal_',x)
rx <- paste0('real_',x)
select(df, matches(x)) %>%
mutate(!!paste0('deflator_',quo_name(x)) := !!ensym(nx) / !!ensym(rx)*100)
}
#Test
deflator_fun(df, 'gdp')
nominal_gdp real_gdp deflator_gdp
1 -0.3332074 0.181303480 -183.78433
2 -1.0185754 -0.138891362 733.36121
3 -1.0717912 0.005764186 -18593.97398
4 0.3035286 0.385280401 78.78123
Note: Learn more about quo_name, !!, and ensym which they are tools for programming with dplyr here
I am using ArrayExpress dataset to build a dataframe, so that I can run in gene pattern.
In my folder, GSE11000, there is a bunch of files, which file name is in this patter,
GSM123445_samples_table.txt
GSM129995_samples_table.txt
Inside each file, the table is in this pattern
Identifier VALUE
10001 0.12323
10002 0.11535
I have a dataframe, clinical_data, that include all the file I want, which is in this pattern
Data.File Samples OS.event
1 GSM123445_samples_table.txt GSM123445 0
2 GSM129995_samples_table.txt GSM129995 0
3 GSM129999_samples_table.txt GSM129999 1
4 GSM130095_samples_table.txt GSM130095 1
I want to create a dataframe which should like this
Identifier GSM123445 GSM129995 GSM129999 GSM130095
1 10001 0.12323 0.14523 0.22387 0.56233
2 10002 0.11535 0.39048 0.23437 -0.12323
3 10006 0.12323 0.35634 0.12237 -0.12889
4 10008 0.11535 0.23454 0.21227 0.90098
This is my code
library(dplyr)
setwd(.../GSE11000)
file_list <- clinical_data[, 1] # create a list that include Data.File
for (file in file_list){
if (!exists("dataset")){ # if dataset not exists, create one
dataset <- read.table(file, header=TRUE, sep="\t") #read txt file from folder
x <- unlist(strsplit(file, "_"))[1] # extract the GSMxxxxxx from the name of files
dataset <- rename(dataset, x = VALUE) # rename the column
}
else {
temp_dataset <- read.table(file, header=TRUE, sep="\t") # read file
x <- unlist(strsplit(file, "_"))[1]
temp_dataset <- rename(temp_dataset, x = VALUE)
dataset<-left_join(dataset, temp_dataset, "Reporter.Identifier")
rm(temp_dataset)
}
}
My outcome is this
Identifier x.x x.y x.x x.y
1 10001 0.12323 0.14523 0.22387 0.56233
2 10002 0.11535 0.39048 0.23437 -0.12323
3 10006 0.12323 0.35634 0.12237 -0.12889
4 10008 0.11535 0.23454 0.21227 0.90098
This is because the rename part had failed to work.
Anyone have any idea how can I solve this problem? and anyone can make my code more efficiency?
If you can tell me how to use bioconductor so that I can work with this data, I will be grateful too.
Similar to #jdobres but using dplyr (and spread):
First, to create some sample data files:
set.seed(42)
for (fname in sprintf("GSM%s_samples_table.txt", sample(10000, size = 4))) {
write.table(data.frame(Identifier = 10001:10004, VALUE = runif(4)),
file = fname, row.names = FALSE)
}
file_list <- list.files(pattern = "GSM.*")
file_list
# [1] "GSM2861_samples_table.txt" "GSM8302_samples_table.txt"
# [3] "GSM9149_samples_table.txt" "GSM9370_samples_table.txt"
read.table(file_list[1], skip = 1, col.names = c("Identifier", "VALUE"))
# Identifier VALUE
# 1 10001 0.9346722
# 2 10002 0.2554288
# 3 10003 0.4622928
# 4 10004 0.9400145
Now the processing:
library(dplyr)
library(tidyr)
mapply(function(fname, varname)
cbind.data.frame(Samples = varname,
read.table(fname, skip = 1, col.names = c("Identifier", "VALUE")),
stringsAsFactors = FALSE),
file_list, gsub("_.*", "", file_list), SIMPLIFY = FALSE) %>%
bind_rows() %>%
spread(Samples, VALUE)
# Identifier GSM2861 GSM8302 GSM9149 GSM9370
# 1 10001 0.9346722 0.9782264 0.6417455 0.6569923
# 2 10002 0.2554288 0.1174874 0.5190959 0.7050648
# 3 10003 0.4622928 0.4749971 0.7365883 0.4577418
# 4 10004 0.9400145 0.5603327 0.1346666 0.7191123
Hard to tell if this will work, since your example isn't quite reproducible, but here's how I'd tackle it.
First, read all of the data files into one large data frame, creating an extra column called "sample" which will hold your sample label.
library(plyr)
df <- ddply(clinical_data, .(Data.File), function(x) {
data.this <- read.table(x$Data.File, header=TRUE, sep="\t")
data.this$sample <- x$Samples
return(data.this)
})
Then use the tidyr::spread function to create a new column for each "sample" with the values in the "VALUE" column.
library(tidyr)
df <- spread(df, sample, VALUE)
I have a dataset with 500 000 entries. Each entry in it has a userId and a productId. I want to get all userIds corresponding to each distinct productIds. But the list is to huge that none of the following method works for me, it's going very slow. Is there any faster solution.
Using lapply: (Problem: Traversing the whole rpid list for each uniqPids elements)
orderedIndx <- lapply(uniqPids, function(x){
which(rpid %in% x)
})
names(orderedIndx) <- uniqPids
#Looking for indices with each unique productIds
Using For loop:
orderedIndx <- list()
for(j in 1:length(rpid)){
existing <- length(orderedIndx[rpid[j]])
orderedIndx[rpid[j]][existing + 1] <- j
}
Sample Data:
ruid[1:10]
# [1] "a3sgxh7auhu8gw" "a1d87f6zcve5nk" "abxlmwjixxain" "a395borc6fgvxv" "a1uqrsclf8gw1t" "adt0srk1mgoeu"
[7] "a1sp2kvkfxxru1" "a3jrgqveqn31iq" "a1mzyo9tzk0bbi" "a21bt40vzccyt4"
rpid[1:10]
# [1] "b001e4kfg0" "b001e4kfg0" "b000lqoch0" "b000ua0qiq" "b006k2zz7k" "b006k2zz7k" "b006k2zz7k" "b006k2zz7k"
[9] "b000e7l2r4" "b00171apva"
Output should be like:
b001e4kfg0 -> a3sgxh7auhu8gw, a1d87f6zcve5nk
b000lqoch0 -> abxlmwjixxain
b000ua0qiq -> a395borc6fgvxv
b006k2zz7k -> a1uqrsclf8gw1t, adt0srk1mgoeu, a1sp2kvkfxxru1, a3jrgqveqn31iq
b000e7l2r4 -> a1mzyo9tzk0bbi
b00171apva -> a21bt40vzccyt4
It seems perhaps you're just looking for split?
split(seq_along(rpid), rpid)
Not exactly sure what type of output you want, or how many rows you have in your dataset, but I'd suggest 3 versions and you can chose the one you like. First version uses dplyr and character values for your variables. I expect this to be slow if you have millions of rows. Second version uses dplyr but factor variables. I expect this to be faster than the previous one. Third version uses data.table. I expect this to be equally fast, or faster than the second version.
library(dplyr)
ruid =
c("a3sgxh7auhu8gw", "a1d87f6zcve5nk", "abxlmwjixxain", "a395borc6fgvxv",
"a1uqrsclf8gw1t", "adt0srk1mgoeu", "a1sp2kvkfxxru1", "a3jrgqveqn31iq",
"a1mzyo9tzk0bbi", "a21bt40vzccyt4")
rpid =
c("b001e4kfg0", "b001e4kfg0", "b000lqoch0", "b000ua0qiq", "b006k2zz7k",
"b006k2zz7k", "b006k2zz7k", "b006k2zz7k", "b000e7l2r4", "b00171apva")
### using dplyr and character values
dt = data.frame(rpid, ruid, stringsAsFactors = F)
dt %>%
group_by(rpid) %>%
do(data.frame(list_ruids = paste(c(.$ruid), collapse=", "))) %>%
ungroup
# rpid list_ruids
# (chr) (chr)
# 1 b000e7l2r4 a1mzyo9tzk0bbi
# 2 b000lqoch0 abxlmwjixxain
# 3 b000ua0qiq a395borc6fgvxv
# 4 b00171apva a21bt40vzccyt4
# 5 b001e4kfg0 a3sgxh7auhu8gw, a1d87f6zcve5nk
# 6 b006k2zz7k a1uqrsclf8gw1t, adt0srk1mgoeu, a1sp2kvkfxxru1, a3jrgqveqn31iq
# ----------------------------------
### using dplyr and factor values
dt = data.frame(rpid, ruid, stringsAsFactors = T)
dt %>%
group_by(rpid) %>%
do(data.frame(list_ruids = paste(c(levels(dt$ruid)[.$ruid]), collapse=", "))) %>%
ungroup
# rpid list_ruids
# (fctr) (chr)
# 1 b000e7l2r4 a1mzyo9tzk0bbi
# 2 b000lqoch0 abxlmwjixxain
# 3 b000ua0qiq a395borc6fgvxv
# 4 b00171apva a21bt40vzccyt4
# 5 b001e4kfg0 a3sgxh7auhu8gw, a1d87f6zcve5nk
# 6 b006k2zz7k a1uqrsclf8gw1t, adt0srk1mgoeu, a1sp2kvkfxxru1, a3jrgqveqn31iq
# -------------------------------------
library(data.table)
### using data.table
dt = data.table(rpid, ruid)
dt[, list(list_ruids = paste(c(ruid), collapse=", ")), by = rpid]
# rpid list_ruids
# 1: b001e4kfg0 a3sgxh7auhu8gw, a1d87f6zcve5nk
# 2: b000lqoch0 abxlmwjixxain
# 3: b000ua0qiq a395borc6fgvxv
# 4: b006k2zz7k a1uqrsclf8gw1t, adt0srk1mgoeu, a1sp2kvkfxxru1, a3jrgqveqn31iq
# 5: b000e7l2r4 a1mzyo9tzk0bbi
# 6: b00171apva a21bt40vzccyt4
Do you have tidy data in a dataframe? Then you can do this.
library(dplyr)
df %>%
select(productId, userId) %>%
distinct