I am using ArrayExpress dataset to build a dataframe, so that I can run in gene pattern.
In my folder, GSE11000, there is a bunch of files, which file name is in this patter,
GSM123445_samples_table.txt
GSM129995_samples_table.txt
Inside each file, the table is in this pattern
Identifier VALUE
10001 0.12323
10002 0.11535
I have a dataframe, clinical_data, that include all the file I want, which is in this pattern
Data.File Samples OS.event
1 GSM123445_samples_table.txt GSM123445 0
2 GSM129995_samples_table.txt GSM129995 0
3 GSM129999_samples_table.txt GSM129999 1
4 GSM130095_samples_table.txt GSM130095 1
I want to create a dataframe which should like this
Identifier GSM123445 GSM129995 GSM129999 GSM130095
1 10001 0.12323 0.14523 0.22387 0.56233
2 10002 0.11535 0.39048 0.23437 -0.12323
3 10006 0.12323 0.35634 0.12237 -0.12889
4 10008 0.11535 0.23454 0.21227 0.90098
This is my code
library(dplyr)
setwd(.../GSE11000)
file_list <- clinical_data[, 1] # create a list that include Data.File
for (file in file_list){
if (!exists("dataset")){ # if dataset not exists, create one
dataset <- read.table(file, header=TRUE, sep="\t") #read txt file from folder
x <- unlist(strsplit(file, "_"))[1] # extract the GSMxxxxxx from the name of files
dataset <- rename(dataset, x = VALUE) # rename the column
}
else {
temp_dataset <- read.table(file, header=TRUE, sep="\t") # read file
x <- unlist(strsplit(file, "_"))[1]
temp_dataset <- rename(temp_dataset, x = VALUE)
dataset<-left_join(dataset, temp_dataset, "Reporter.Identifier")
rm(temp_dataset)
}
}
My outcome is this
Identifier x.x x.y x.x x.y
1 10001 0.12323 0.14523 0.22387 0.56233
2 10002 0.11535 0.39048 0.23437 -0.12323
3 10006 0.12323 0.35634 0.12237 -0.12889
4 10008 0.11535 0.23454 0.21227 0.90098
This is because the rename part had failed to work.
Anyone have any idea how can I solve this problem? and anyone can make my code more efficiency?
If you can tell me how to use bioconductor so that I can work with this data, I will be grateful too.
Similar to #jdobres but using dplyr (and spread):
First, to create some sample data files:
set.seed(42)
for (fname in sprintf("GSM%s_samples_table.txt", sample(10000, size = 4))) {
write.table(data.frame(Identifier = 10001:10004, VALUE = runif(4)),
file = fname, row.names = FALSE)
}
file_list <- list.files(pattern = "GSM.*")
file_list
# [1] "GSM2861_samples_table.txt" "GSM8302_samples_table.txt"
# [3] "GSM9149_samples_table.txt" "GSM9370_samples_table.txt"
read.table(file_list[1], skip = 1, col.names = c("Identifier", "VALUE"))
# Identifier VALUE
# 1 10001 0.9346722
# 2 10002 0.2554288
# 3 10003 0.4622928
# 4 10004 0.9400145
Now the processing:
library(dplyr)
library(tidyr)
mapply(function(fname, varname)
cbind.data.frame(Samples = varname,
read.table(fname, skip = 1, col.names = c("Identifier", "VALUE")),
stringsAsFactors = FALSE),
file_list, gsub("_.*", "", file_list), SIMPLIFY = FALSE) %>%
bind_rows() %>%
spread(Samples, VALUE)
# Identifier GSM2861 GSM8302 GSM9149 GSM9370
# 1 10001 0.9346722 0.9782264 0.6417455 0.6569923
# 2 10002 0.2554288 0.1174874 0.5190959 0.7050648
# 3 10003 0.4622928 0.4749971 0.7365883 0.4577418
# 4 10004 0.9400145 0.5603327 0.1346666 0.7191123
Hard to tell if this will work, since your example isn't quite reproducible, but here's how I'd tackle it.
First, read all of the data files into one large data frame, creating an extra column called "sample" which will hold your sample label.
library(plyr)
df <- ddply(clinical_data, .(Data.File), function(x) {
data.this <- read.table(x$Data.File, header=TRUE, sep="\t")
data.this$sample <- x$Samples
return(data.this)
})
Then use the tidyr::spread function to create a new column for each "sample" with the values in the "VALUE" column.
library(tidyr)
df <- spread(df, sample, VALUE)
Related
I'm importing several .csvs that are all two columns wide (they're output from a program) - the first column is wavelength and the second is absorbance, but I'm naming it by the file name to be combined later like from this old stack overflow answer (Combining csv files in R to different columns). The incoming .csvs don't have headers, and I'm aware that the way I'm naming them crops the first data points. I would like for the first column to not have any decimals and standardize all of the numbers to four digits - the code I've added works on its own but not in this block - and I would prefer to do this formatting all in one go. I run into errors with $ not being the right operator, but when I use [] I get errors about that too. The column I need to do this to is the first and it's named 'Wavelength' - which also gives me errors either because wavelength doesn't exist or it's nonnumeric. Any ideas?
This is what my script currently looks like:
for (file in file_list) {
f <- sub("(.*)\\.CSV", "\\1", file)
assign(f, read.csv(file = file))
assign(f, setNames(get(f), c(names(get(f))[0:0], "Wavelength")))
assign(f, setNames(get(f), c(names(get(f))[1:1], file)))
floor(f[Wavelength]) #the issues are here
sprintf("%04d", f$Wavelength) #and here
}
The data looks like this in the csv before it gets processed:
1 401.7664 0.1379457
2 403.8058 0.1390427
3 405.8452 0.1421666
4 407.8847 0.1463629
5 409.9241 0.1477264
I would like the output to be:
Wavelength (file name)
1 0401 0.1379457
2 0403 0.1390427
3 0405 0.1421666
4 0407 0.1463629
5 0409 0.1477264
And here's the dput that r2evans asked for:
structure(list(X3.997270e.002 = c(401.7664, 403.8058, 405.8452,
407.8847, 409.9241, 411.9635), X1.393858e.001 = c(0.1379457,
0.1390427, 0.1421666, 0.1463629, 0.1477264, 0.1476971)), row.names =
c(NA,
6L), class = "data.frame")
Thanks in advance!
6/24 Update:
When I assign the column name "Wavelength" it only gets added as a character, not as a real column name? When I dput/head the files once they go through (omitting the sprintf/floor functions) it only lists the file name (the second column). When I open the csvs in R studio the first column is properly labeled - and even further I'm able to combine all the csvs sorted by "Wavelength":
list_csvs <- mget(sub("(.*)\\.CSV", "\\1", file_list))
all_csvs <- Reduce(function(x, y) merge(x, y, all=T,
by=c("Wavelength")), list_csvs, accumulate=F)
Naturally I've thought about just formatting the column after this, but some of the decimals are off in the thousands place so I do need to format before I merge the csvs.
I've updated the code to use colnames outside of the read.csv:
for (file in file_list) {
f <- sub("(.*)\\.CSV", "\\1", file)
assign(f, read.csv(file = file,
header = FALSE,
row.names = NULL))
colnames(f) <- c("Wavelength", file)
print(summary(f))
print(names(f))
#floor("Wavelength") #I'm omitting this to see the console errors
#sprintf("%04.0f", f["Wavelength"]) #omitting this too
}
but I get the following error:
attempt to set 'colnames' on an object with less than two dimensions
Without the naming bit and without the sprintf/floor I get this back from the summary and names prompt for each file:
Length Class Mode
1 character character
NULL
When I try to call out the first column by f[1], f[[1]], f[,1], or f[[,1]] I get error messages about 'incorrect number of dimensions'. I can clearly see in the R environment that each data frame has a length of 2. I also double checked with .row_names_info(f) that the first column isn't being read as row names. What am I doing wrong?
I'm going to suggest a dplyr/tidyr pipe for this.
First, data-setup:
writeLines(
"401.7664,0.1379457
403.8058,0.1390427
405.8452,0.1421666
407.8847,0.1463629
409.9241,0.1477264", "sample1.csv")
file.copy("sample1.csv", "sample2.csv")
file_list <- normalizePath(list.files(pattern = ".*\\.csv$", full.names = TRUE), winslash = "/")
file_list
# [1] "C:/Users/r2/StackOverflow/13765634/sample1.csv"
# [2] "C:/Users/r2/StackOverflow/13765634/sample2.csv"
First, I'm going to suggest a slightly different format: not naming the column for the filename. I like this because I'm still going to preserve the filename with the data (as a category, so to speak), but it allows you to combine all of your data into one frame for more efficient processing:
library(dplyr)
library(purrr) # map*
library(tidyr) # pivot_wider
file_list %>%
set_names(.) %>%
# set_names(tools::file_path_sans_ext(basename(.))) %>%
map_dfr(~ read.csv(.x, header = FALSE, col.names = c("freq", "val")),
.id = "filename") %>%
mutate(freq = sprintf("%04.0f", freq))
# filename freq val
# 1 C:/Users/r2/StackOverflow/13765634/sample1.csv 0402 0.1379457
# 2 C:/Users/r2/StackOverflow/13765634/sample1.csv 0404 0.1390427
# 3 C:/Users/r2/StackOverflow/13765634/sample1.csv 0406 0.1421666
# 4 C:/Users/r2/StackOverflow/13765634/sample1.csv 0408 0.1463629
# 5 C:/Users/r2/StackOverflow/13765634/sample1.csv 0410 0.1477264
# 6 C:/Users/r2/StackOverflow/13765634/sample2.csv 0402 0.1379457
# 7 C:/Users/r2/StackOverflow/13765634/sample2.csv 0404 0.1390427
# 8 C:/Users/r2/StackOverflow/13765634/sample2.csv 0406 0.1421666
# 9 C:/Users/r2/StackOverflow/13765634/sample2.csv 0408 0.1463629
# 10 C:/Users/r2/StackOverflow/13765634/sample2.csv 0410 0.1477264
Options: if you prefer just the filename (no path) and are certain that there is no filename collision, then use set_names(basename(.)) instead. (This step is really necessary when using the filename as a column name anyway.) I'll also remove the file extension, since they're likely all .csv or similar.
file_list %>%
# set_names(.) %>%
set_names(tools::file_path_sans_ext(basename(.))) %>%
map_dfr(~ read.csv(.x, header = FALSE, col.names = c("freq", "val")),
.id = "filename") %>%
mutate(freq = sprintf("%04.0f", freq))
# filename freq val
# 1 sample1 0402 0.1379457
# 2 sample1 0404 0.1390427
# 3 sample1 0406 0.1421666
# 4 sample1 0408 0.1463629
# 5 sample1 0410 0.1477264
# 6 sample2 0402 0.1379457
# 7 sample2 0404 0.1390427
# 8 sample2 0406 0.1421666
# 9 sample2 0408 0.1463629
# 10 sample2 0410 0.1477264
(If you need to do something to each dataset at a time, then you should use %>% group_by(filename), not sure if that's relevant.)
If you really need the filename to be the column name of the value, then modify this slightly so that it preserves it as a list:
file_list %>%
set_names(tools::file_path_sans_ext(basename(.))) %>%
map(~ read.csv(.x, header = FALSE, col.names = c("freq", "val"))) %>%
map2(., names(.), ~ transmute(.x, freq = sprintf("%04.0f", freq), !!.y := val))
# $sample1
# freq sample1
# 1 0402 0.1379457
# 2 0404 0.1390427
# 3 0406 0.1421666
# 4 0408 0.1463629
# 5 0410 0.1477264
# $sample2
# freq sample2
# 1 0402 0.1379457
# 2 0404 0.1390427
# 3 0406 0.1421666
# 4 0408 0.1463629
# 5 0410 0.1477264
But I'm going to infer that ultimately you want to combine these column-wise, assuming there will be alignment in the freq column. (I can't think of another reason why you'd want the column name to be the filename.)
For that, try this, reverting to the first use of map_dfr, introducing pivot_wider:
file_list %>%
set_names(tools::file_path_sans_ext(basename(.))) %>%
map_dfr(~ read.csv(.x, header = FALSE, col.names = c("freq", "val")),
.id = "filename") %>%
mutate(freq = sprintf("%04.0f", freq)) %>%
pivot_wider(freq, names_from = filename, values_from = val)
# # A tibble: 5 x 3
# freq sample1 sample2
# <chr> <dbl> <dbl>
# 1 0402 0.138 0.138
# 2 0404 0.139 0.139
# 3 0406 0.142 0.142
# 4 0408 0.146 0.146
# 5 0410 0.148 0.148
Notes (perhaps more of a soap-box):
Regarding your use of assign, I strongly discourage this behavior. Since the data is effectively all structured the same, I infer that you'll be doing the same thing to each of these files. In that case, it is much better to use one of the *apply functions on a list of data.frames. That is, instead of having to iterate over a list of variable names, get it, do something, then reassign it ... it is often much easier (to program, to read, to maintain) dats <- lapply(dats, some_function) or dats2 <- lapply(dats, function(x) { ...; x; }).
Regarding the use of filename-as-column-name. Some tools (e.g., ggplot2) really benefit from having "long" data (i.e., one or more category columns such as filename, and one column for each type of data ... type is relative to your understanding of the data). You might benefit from reframing your thinking on working with this data.
I'm trying to extract the "title" and "id" of movies coming from json content from several pages. This is the piece of code I use for making the request:
themoviedb_ip_address <- "https://api.themoviedb.org"
themovidedb_discover_movie_url <- "/3/discover/movie"
pages <- c(1,2)
themovidedb_discover_movie_req <- paste(themoviedb_ip_address,
themovidedb_discover_movie_url,
"?",
api_key,"&sort_by=revenue.desc",
"&include_adult=false",
"&include_video=false",
"&page=",
"{pages}",
"&primary_release_year=2010",
"&with_genres=18",
sep = "")
movie_revenue_2010 <- str_glue(themovidedb_discover_movie_req) %>%
map(GET) %>%
map(content,as = "parsed") %>%
map(purrr::pluck, "results")
This gives me the following result:
and
When trying to extract the titles and ids with the following piece of code:
movie_revenue_2010 <- str_glue(themovidedb_discover_movie_req) %>%
map(GET) %>%
map(content,as = "parsed") %>%
map(purrr::pluck, "results") %>%
map_df(magrittr::extract, c("title", "id"))
I get the following error: Error: Argument 1 must have names
Please note that the following piece of code works correctly:
for(i in 1:2){
themovidedb_discover_movie_query_string <- paste("&sort_by=revenue.desc",
"&include_adult=false",
"&include_video=false",
"&page=",
i,
"&primary_release_year=2010",
"&with_genres=18",
sep = "")
#print(themovidedb_discover_movie_query_string)
movie_revenue_2010_req <-
httr::GET(paste(themoviedb_ip_address,
themovidedb_discover_movie_url,
"?",
api_key,
themovidedb_discover_movie_query_string,
sep = ""))
movie_revenue_2010_content <- httr::content(movie_revenue_2010_req,
as = "parsed")
movie_revenue_2010 <- purrr::pluck(movie_revenue_2010_content, "results")
movie_revenue_2010_tbl <- movie_revenue_2010_tbl %>%
bind_rows(map_df(movie_revenue_2010, extract, c("title", "id")))
}
But I can't use "for loop" in my work.
A dput() of the content is available here:
movie_revenue_2010 <- str_glue(themovidedb_discover_movie_req) %>%
map(GET) %>%
map(content,as = "parsed") %>%
map(purrr::pluck, "results") %>%
dput()
dput()
Assuming your nested list is called temp, we can do
library(purrr)
map(temp, ~map_df(.x, `[`, c('title', 'id')))
#[[1]]
# A tibble: 20 x 2
# title id
# <chr> <int>
# 1 The Twilight Saga: Eclipse 24021
# 2 The King's Speech 45269
# 3 The Karate Kid 38575
# 4 Black Swan 44214
# 5 Robin Hood 20662
#...
#...
#[[2]]
# A tibble: 20 x 2
# title id
# <chr> <int>
# 1 The Other Woman 52505
# 2 Green Zone 22972
# 3 The Fighter 45317
# 4 Burlesque 42297
# 5 Letters to Juliet 37056
#...
Or if you want everything in one dataframe
map_df(temp, ~map_df(.x, `[`, c('title', 'id')))
# A tibble: 40 x 2
# title id
# <chr> <int>
# 1 The Twilight Saga: Eclipse 24021
# 2 The King's Speech 45269
# 3 The Karate Kid 38575
# 4 Black Swan 44214
# 5 Robin Hood 20662
# 6 Shutter Island 11324
# 7 Sex and the City 2 37786
# 8 True Grit 44264
# 9 The Social Network 37799
#10 The Sorcerer's Apprentice 27022
# … with 30 more rows
In base R, we can do
lapply(temp, function(x) do.call(rbind, lapply(x, `[`, c('title', 'id'))))
and
do.call(rbind, lapply(temp, function(x) do.call(rbind,lapply(x, `[`, c('title', 'id')))))
respectively for the same output as above.
I have used the walkscore API to generate output for a list of Lat and Longs
Reprex of dataset:
tibble::tribble(
~Lat, ~Long,
39.75454546, -82.63637088,
40.85117794, -81.47034464,
40.53956136, -74.33630685,
42.16066679, -71.21368025,
39.27048579, -119.5770782,
64.82534285, -147.6738774
)
My code:
library(walkscoreAPI)
library(rjson)
data = read.csv(file="geocode_finalcompiled.csv", header=TRUE, sep=",")
attach(data)
#create empty list
res = list()
# for loop through a file
for(i in 1:500){
res[i] = list(getWS(data$Long[i],data$Lat[i],"mykey"))
}
show results
res
> res
[[1]]
$status
[1] 1
$walkscore
[1] 2
$description
[1] "Car-Dependent"
$updated
[1] "2019-03-28 21:43:37.670012"
$snappedLong
[1] -82.6365
$snappedLat
[1] 39.7545
As you can see the output is in json format. My objective is to make this into a dataframe where each value is displayed under each header and can be put into a csv.
I tried:
resformatted <- as.data.frame(res)
But got the below error:
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
cannot coerce class ‘"WalkScore"’ to a data.frame
What can be done to fix this?
Going off the above approach:
library(dplyr)
library(tibble)
res %>%
sapply(unclass) %>%
as.data.frame() %>%
t() %>%
as.data.frame() %>%
lapply(unlist) %>%
as.data.frame(stringsAsFactors = FALSE) %>%
remove_rownames() -> df
Produces:
# status walkscore description updated snappedLong snappedLat
# 1 1 2 Car-Dependent 2019-03-28 21:43:37.670012 -82.6365 39.7545
# 2 1 4 Car-Dependent 2019-04-11 11:23:51.651955 -81.471 40.851
# 3 1 60 Somewhat Walkable 2019-02-25 01:05:08.918498 -74.337 40.539
# 4 1 44 Car-Dependent 2019-04-17 16:26:58.848496 -71.214 42.1605
# 5 1 16 Car-Dependent 2019-05-09 01:34:59.741290 -119.577 39.27
# 6 1 0 Car-Dependent 2019-07-22 19:27:50.170107 -147.6735 64.8255
And write to csv with:
write.csv(df, file = "dfwalk.csv")
I have more than 500 files (df1) in a folder and i want to create a new files by merging it with a reference table (nf1).
data[1] <- Composite.REF Call Confidence
SNP_A-2131660 2 0.0053
SNP_A-1967418 2 0.0075
SNP_A-1969580 2 0.0042
SNP_A-4263484 2 0.0052
nf1 <-
Composite.REF dbSNP.RS.ID Chromosome Physical.Position Allele.A Allele.B Gene region
SNP_A-2131660 rs4147951 2 66943738 A G ABCA8 intron
SNP_A-1967418 rs2022235 2 14326088 C T --- downstream
SNP_A-1969580 rs6425720 2 31709555 A G NKAIN1 intron
SNP_A-4263484 rs12997193 2 106584554 A C --- upstream
finalFile <-
Composite.REF dbSNP.RS.ID Chromosome Physical.Position Allele.A Allele.B Gene region data[1]
SNP_A-1969580 rs6425720 2 31709555 A G NKAIN1 intron 0.042
listFiles <- list.files(pattern = "data.txt$",recursive=T) # list all the files with extension data.txt
for (i in 1:length(listFiles)){
data<-read.table(file=paste(listFiles[i]), sep="\t", skip=1, header=T)
dataF <-data[data$Confidence < 0.05,] #add a filter
finalFile <- merge(dataF, nf1, by = "Composite.Element.REF") #merge 2 data based on common column
write.table(finalFile, gsub("data.txt", "data_new.txt" ,listFiles[i]), sep = "\t", row.names=F, quote=F) #save the output
}
This takes a lots of time in complete as it loops though one sample at a time. i want to know if there are more elegant for the job.
It's extremely difficult to answer this question without some data, but the plyr package would allow you to do something like this:
library(plyr)
data.main <- adply(listFiles, 1, read.table, sep="\t", skip=1, header=T) # load all files
data.main <- subset(data.main, Confidence < 0.05) # reduce data by cutoff value
data.main <- merge(data.main, nf1, by = 'Composite.Element.REF') # merge data sets
# write out all files
d_ply(data.main, .(.id), function(x) {
file.name <- sprintf('new data %i.txt', listFiles[x$.id[1]])
write.table(x, file.name, sep = "\t", row.names=F, quote=F) #save the output
})
Good day,
I will present two [likely] very puny problems for your excellent review.
Problem #1
I have a relatively tidy df (dat) with dim 10299 x 563. The 563 variables common to both datasets [that created] dat are 'subject' (numeric), 'label' (numeric), 3:563 (variable names from a text file). Observations 1:2947 are from a 'test' dataset whereas observations 2948:10299 are from a 'training' dataset.
I'd like to insert a column (header = 'type') into dat that is basically rows 1:2947 comprised of string test and rows 2948:10299 of string train that way I can group later on dataset or other similar aggregate functions in dplyr/tidyr.
I created a test df (testdf = 1:10299: dim(testdf) = 102499 x 1) and then:
testdat[1:2947 , "type"] <- c("test")
testdat[2948:10299, "type"] <- c("train")
> head(ds, 2);tail(ds, 2)
X1.10299 type
1 1 test
2 2 test
X1.10299 type
10298 10298 train
10299 10299 train
So I really don't like that there is now a column of X1.10299.
Questions:
Is there a better and more expedient way to create a column that has what I'm looking for based upon my use case above?
What is a good way to actually insert that column into 'dat' so that I can use it later for grouping with dplyr?
Problem #2
The way I arrived at my [nearly] tidy df (dat) from above was to two take dfs (test and train) of the form dim(2947 x 563 and 7352 x 563), respectively, and rbinding them together.
I confirm that all of my variable names are present after the binding effort by something like this:
test.names <- names(test)
train.names <- names(train)
identical(test.names, train.names)
> TRUE
What is interesting and of primary concern is that if I try to use the bind_rows function from 'dplyr' to perform the same binding exercise:
dat <- bind_rows(test, train)
It returns a dataframe that apparently keeps my all of my observations (x: 10299) but now my variable count is reduced from 563 to 470!
Question:
Does anyone know why my variables are being chopped?
Is this the best way to combine two dfs of the same structure for later slicing/dicing with dplyr/
tidyr?
Thank you for your time and consideration of these matters.
Sample test/train dfs for review (the left most numeric are df indices):
test df
test[1:10, 1:5]
subject labels tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z
1 2 5 0.2571778 -0.02328523 -0.01465376
2 2 5 0.2860267 -0.01316336 -0.11908252
3 2 5 0.2754848 -0.02605042 -0.11815167
4 2 5 0.2702982 -0.03261387 -0.11752018
5 2 5 0.2748330 -0.02784779 -0.12952716
6 2 5 0.2792199 -0.01862040 -0.11390197
7 2 5 0.2797459 -0.01827103 -0.10399988
8 2 5 0.2746005 -0.02503513 -0.11683085
9 2 5 0.2725287 -0.02095401 -0.11447249
10 2 5 0.2757457 -0.01037199 -0.09977589
train df
train[1:10, 1:5]
subject label tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z
1 1 5 0.2885845 -0.020294171 -0.1329051
2 1 5 0.2784188 -0.016410568 -0.1235202
3 1 5 0.2796531 -0.019467156 -0.1134617
4 1 5 0.2791739 -0.026200646 -0.1232826
5 1 5 0.2766288 -0.016569655 -0.1153619
6 1 5 0.2771988 -0.010097850 -0.1051373
7 1 5 0.2794539 -0.019640776 -0.1100221
8 1 5 0.2774325 -0.030488303 -0.1253604
9 1 5 0.2772934 -0.021750698 -0.1207508
10 1 5 0.2805857 -0.009960298 -0.1060652
Actual Code (ignore the function calls/I'm doing most of the testing via console).
[http://archive.ics.uci.edu/ml/machine-learning-databases/00240/]The data set I'm using with this code. 1
run_analysis <- function () {
#Vars available for use throughout the function that should be preserved
vars <- read.table("features.txt", header = FALSE, sep = "")
lookup_table <- data.frame(activitynum = c(1,2,3,4,5,6),
activity_label = c("walking", "walking_up",
"walking_down", "sitting",
"standing", "laying"))
test <- test_read_process(vars, lookup_table)
train <- train_read_process(vars, lookup_table)
}
test_read_process <- function(vars, lookup_table) {
#read in the three documents for cbinding later
test.sub <- read.table("test/subject_test.txt", header = FALSE)
test.labels <- read.table("test/y_test.txt", header = FALSE)
test.obs <- read.table("test/X_test.txt", header = FALSE, sep = "")
#cbind the cols together and set remaining colNames to var names in vars
test.dat <- cbind(test.sub, test.labels, test.obs)
colnames(test.dat) <- c("subject", "labels", as.character(vars[,2]))
#Use lookup_table to set the "test_labels" string values that correspond
#to their integer IDs
#test.lookup <- merge(test, lookup_table, by.x = "labels",
# by.y ="activitynum", all.x = T)
#Remove temporary symbols from globalEnv/memory
rm(test.sub, test.labels, test.obs)
#return
return(test.dat)
}
train_read_process <- function(vars, lookup_table) {
#read in the three documents for cbinding
train.sub <- read.table("train/subject_train.txt", header = FALSE)
train.labels <- read.table("train/y_train.txt", header = FALSE)
train.obs <- read.table("train/X_train.txt", header = FALSE, sep = "")
#cbind the cols together and set remaining colNames to var names in vars
train.dat <- cbind(train.sub, train.labels, train.obs)
colnames(train.dat) <- c("subject", "label", as.character(vars[,2]))
#Clean up temporary symbols from globalEnv/memory
rm(train.sub, train.labels, train.obs, vars)
return(train.dat)
}
The problem that you're facing stems from the fact that you have duplicated names in the variable list that you're using to create your data frame objects. If you ensure that the column names are unique and shared between the objects the code will run. I've included a fully working example based on the code you used above (with fixes and various edits noted in the comments):
vars <- read.table(file="features.txt", header=F, stringsAsFactors=F)
## FRS: This is the source of original problem:
duplicated(vars[,2])
vars[317:340,2]
duplicated(vars[317:340,2])
vars[396:419,2]
## FRS: I edited the following to both account for your data and variable
## issues:
test_read_process <- function() {
#read in the three documents for cbinding later
test.sub <- read.table("test/subject_test.txt", header = FALSE)
test.labels <- read.table("test/y_test.txt", header = FALSE)
test.obs <- read.table("test/X_test.txt", header = FALSE, sep = "")
#cbind the cols together and set remaining colNames to var names in vars
test.dat <- cbind(test.sub, test.labels, test.obs)
#colnames(test.dat) <- c("subject", "labels", as.character(vars[,2]))
colnames(test.dat) <- c("subject", "labels", paste0("V", 1:nrow(vars)))
return(test.dat)
}
train_read_process <- function() {
#read in the three documents for cbinding
train.sub <- read.table("train/subject_train.txt", header = FALSE)
train.labels <- read.table("train/y_train.txt", header = FALSE)
train.obs <- read.table("train/X_train.txt", header = FALSE, sep = "")
#cbind the cols together and set remaining colNames to var names in vars
train.dat <- cbind(train.sub, train.labels, train.obs)
#colnames(train.dat) <- c("subject", "labels", as.character(vars[,2]))
colnames(train.dat) <- c("subject", "labels", paste0("V", 1:nrow(vars)))
return(train.dat)
}
test_df <- test_read_process()
train_df <- train_read_process()
identical(names(test_df), names(train_df))
library("dplyr")
## FRS: These could be piped together but I've kept them separate for clarity:
train_df %>%
mutate(test="train") ->
train_df
test_df %>%
mutate(test="test") ->
test_df
test_df %>%
bind_rows(train_df) ->
out_df
head(out_df)
out_df
## FRS: You can set your column names to those of the original
## variable list but you still have duplicates to deal with:
names(out_df) <- c("subject", "labels", as.character(vars[,2]), "test")
duplicated(names(out_df))