Combine, Order, Dedup over Multiple Files in R - r

I have a large number of CSV files that look like this:
var val1 val2
a 2 1
b 2 2
c 3 3
d 9 2
e 1 1
I would like to:
Read them in
Take the top 3 from each CSV
Make a list of the variable names only (3 x number of files)
Keep only the unique names on the list
I think I have managed to get to point 3 by doing this:
csvList <- list.files(path = "mypath", pattern = "*.csv", full.names = T)
bla <- lapply(lapply(csvList, read.csv), function(x) x[order(x$val1, decreasing=T)[1:3], ])
lapply(bla,"[", , 1, drop=FALSE)
Now, I have a list of the top 3 variables in each CSV. However, I don't know how to convert this list to a string and keep only the unique values.
Any help is welcome.
Thank you!

The issue is in extracting the first columns of bla with drop=FALSE. This preserves the results as a list of columns (where each row has a name) instead of coercing it to its lowest dimension, which is a vector. Use drop=TRUE instead and then unlist followed by unique as #Frank suggests:
unique(unlist(lapply(bla,"[", , 1, drop=TRUE)))
As you know, drop=TRUE is the default, so you don't even have to include it.
Update to new requirements in comments.
To keep the first two columns var and var1 and remove duplicates in var (keep only the unique vars), do the following:
## unlist each column in turn and form a data frame
res <- data.frame(lapply(c(1,2), function(x) unlist(lapply(bla,"[", , x))))
colnames(res) <- c("var","var1") ## restore the two column names
## remove duplicates
res <- res[!duplicated(res[,1]),]
Note that this will only keep the first row for each unique var. This is the definition of removing duplicates here.
Hope this helps.

Related

R: How do you subset all data-frames within a list?

I have a list of data-frames called WaFramesCosts. I want to simply subset it to show specific columns so that I can then export them. I have tried:
for (i in names(WaFramesCosts)) {
WaFramesCosts[[i]][,c("Cost_Center","Domestic_Anytime_Min_Used","Department",
"Domestic_Anytime_Min_Used")]
}
but it returns the error of
Error in `[.data.frame`(WaFramesCosts[[i]], , c("Cost_Center", "Department", :
undefined columns selected
I also tried:
for (i in seq_along(WaFramesCosts)){
WaFramesCosts[[i]][ , -which(names(WaFramesCosts[[i]]) %in% c("Cost_Center","Domestic_Anytime_Min_Used","Department",
"Domestic_Anytime_Min_Used"))]
but I get the same error. Can anyone see what I am doing wrong?
Side Note: For reference, I used this:
for (i in seq_along(WaFramesCosts)) {
t <- WaFramesCosts[[i]][ , grepl( "Domestic" , names( WaFramesCosts[[i]] ) )]
q <- subset(WaFramesCosts[[i]], select = c("Cost_Center","Domestic_Anytime_Min_Used","Department","Domestic_Anytime_Min_Used"))
WaFramesCosts[[i]] <- merge(q,t)
}
while attempting the same goal with a different approach and seemed to get closer.
Welcome back, Kootseeahknee. You are still incorrectly assuming that the last command of a for loop is implicitly returned at the end. If you want that behavior, perhaps you want lapply:
myoutput <- lapply(names(WaFramesCosts)), function(i) {
WaFramesCosts[[i]][,c("Cost_Center","Domestic_Anytime_Min_Used","Department","Domestic_Anytime_Min_Used")]
})
The undefined columns selected error tells me that your assumptions of the datasets are not correct: at least one is missing at least one of the columns. From your previous question (How to do a complex edit of columns of all data frames in a list?), I'm inferring that you want columns that match, not assuming that it is in everything. From that, you could/should be using grep or some variant:
myoutput <- lapply(names(WaFramesCosts)), function(i) {
WaFramesCosts[[i]][,grep("(Cost_Center|Domestic_Anytime_Min_Used|Department)",
colnames(WaFramesCosts)),drop=FALSE]
})
This will match column names that contain any of those strings. You can be a lot more precise by ensuring whole strings or start/end matches occur by using regular expressions. For instance, changing from (Cost|Dom) (anything that contains "Cost" or "Dom") to (^Cost|Dom) means anything that starts with "Cost" or contains "Dom"; similarly, (Cost|ment$) matches anything that contains "Cost" or ends with "ment". If, however, you always want exact matches and just need those that exist, then something like this will work:
myoutput <- lapply(names(WaFramesCosts)), function(i) {
WaFramesCosts[[i]][,intersect(c("Cost_Center","Domestic_Anytime_Min_Used","Department"),
colnames(WaFramesCosts)),drop=FALSE]
})
Note, in that last example: notice the difference between mtcars[,2] (returns a vector) and mtcars[,2,drop=FALSE] (returns a data.frame with 1 column). Defensive programming, if you think it at all possible that your filtering will return a single-column, make sure you do not inadvertently convert to a vector by appending ,drop=FALSE to your bracket-subsetting.
Based on your description, this is an example of using library dplyr to achieve combining a list of data frames for a given set of columns. This doesn't require all data frames to have identical columns (Providing your data in a reproducible example would be better)
# test data
df1 = read.table(text = "
c1 c2 c3
a 1 101
b 2 102
", header = TRUE, stringsAsFactors = FALSE)
df2 = read.table(text = "
c1 c2 c3
w 11 201
x 12 202
", header = TRUE, stringsAsFactors = FALSE)
# dfs is a list of data frames
dfs <- list(df1, df2)
# use dplyr::bind_rows
library(dplyr)
cols <- c("c1", "c3")
result <- bind_rows(dfs)[cols]
result
# c1 c3
# 1 a 101
# 2 b 102
# 3 w 201
# 4 x 202

Renaming Column Headers

I want to map the FactorName in the dataframe FName to the column header names of Stack. Ie Factor1 in Stack is actually named Value, Factor 2 is Leverage etc. I have a large dataset so manually renaming is not an option.
Stack <- data.frame(rowid=1:3, Factor1=2:4, Factor2=3:5, Factor3=4:6)
FName <- data.frame(FactorID=c("Factor1","Factor2","Factor3"), FactorName=c("Value","Leverage","Growth"))
Thanks.
How about this using match:
Stack <- data.frame(rowid=1:3, Factor1=2:4, Factor2=3:5, Factor3=4:6)
FName <- data.frame(
FactorID=c("Factor1","Factor2","Factor3"),
FactorName=c("Value","Leverage","Growth"))
# Matching entries from FName
colnames(Stack) <- ifelse(
!is.na(FName$FactorName[match(colnames(Stack), FName$FactorID)]),
as.character(FName$FactorName[match(colnames(Stack), FName$FactorID)]),
colnames(Stack));
Stack;
# rowid Value Leverage Growth
#1 1 2 3 4
#2 2 3 4 5
#3 3 4 5 6
Explanation: We match column names of Stack and entries from FName$FactorID. If there is a match, replace with FName$FactorName, else keep the original column name.
if we have factor names handy then we can use that to change the column names
colnames(Stack) <- "facotor header file"
Another approach using match, but using indexing instead of ifelse
# Get indices of matches
m <- match(names(Stack), FName$FactorID)
# replace names where a match is found.
names(Stack)[!is.na(m)] <- as.character(FName$FactorName[m[!is.na(m)]])

What's the best way to add a specific string to all column names in a dataframe in R?

I am trying to train a data that's converted from a document term matrix to a dataframe. There are separate fields for the positive and negative comments, so I wanted to add a string to the column names to serve as a "tag", to differentiate the same word coming from the different fields - for example, the word hello can appear both in the positive and negative comment fields (and thus, represented as a column in my dataframe), so in my model, I want to differentiate these by making the column names positive_hello and negative_hello.
I am looking for a way to rename columns in such a way that a specific string will be appended to all columns in the dataframe. Say, for mtcars, I want to rename all of the columns to have "_sample" at the end, so that the column names would become mpg_sample, cyl_sample, disp_sample and so on, which were originally mpg, cyl, and disp.
I'm considering using sapplyor lapply, but I haven't had any progress on it. Any help would be greatly appreciated.
Use colnames and paste0 functions:
df = data.frame(x = 1:2, y = 2:1)
colnames(df)
[1] "x" "y"
colnames(df) <- paste0('tag_', colnames(df))
colnames(df)
[1] "tag_x" "tag_y"
If you want to prefix each item in a column with a string, you can use paste():
# Generate sample data
df <- data.frame(good=letters, bad=LETTERS)
# Use the paste() function to append the same word to each item in a column
df$good2 <- paste('positive', df$good, sep='_')
df$bad2 <- paste('negative', df$bad, sep='_')
# Look at the results
head(df)
good bad good2 bad2
1 a A positive_a negative_A
2 b B positive_b negative_B
3 c C positive_c negative_C
4 d D positive_d negative_D
5 e E positive_e negative_E
6 f F positive_f negative_F
Edit:
Looks like I misunderstood the question. But you can rename columns in a similar way:
colnames(df) <- paste(colnames(df), 'sample', sep='_')
colnames(df)
[1] "good_sample" "bad_sample" "good2_sample" "bad2_sample"
Or to rename one specific column (column one, in this case):
colnames(df)[1] <- paste('prefix', colnames(df)[1], sep='_')
colnames(df)
[1] "prefix_good_sample" "bad_sample" "good2_sample" "bad2_sample"
You can use setnames from the data.table package, it doesn't create any copy of your data.
library(data.table)
df <- data.frame(a=c(1,2),b=c(3,4))
# a b
# 1 1 3
# 2 2 4
setnames(df,paste0(names(df),"_tag"))
print(df)
# a_tag b_tag
# 1 1 3
# 2 2 4

extracting variable from file names in R

I have files that contain multiple rows, I want to add two new rows that I create by extracting varibles from the filename and multipling them by current rows.
For example I have a bunch of file that are named something like this
file1[1000,1001].txt
file1[2000,1001].txt
between the [] there are always 2 numbers spearated by a comma
the file itself has multiple columns, for example column1 & column2
I want for each file to extract the 2 values in the name of the file and then use them as variables to make 2 new columns that used the variable to modify the values.
for example
file1[1000,2000]
the file contains two columns
column1 column2
1 2
2 4
I want at the end to add the first file name value to column 1 to create column3 and add the second file name value to column 2 to create column 4, ending up with something like this
column1 column2 column3 column4
1 2 1001 2002
2 4 1002 2004
thanks for the help. I am almost there just a few more issues
original files has 2 columns "X_Parameter" "Y_Parameter", the file name is "test(64084,4224).txt
your code works great at extracting the two values V1 "64084" and V2 "4224" from the file name. I then add these values to the original data set. this yields 4 columns. "X_Parameter" "Y_Parameter" "V1" "V2".
setwd("~/Desktop/txt/")
txt_names = list.files(pattern = ".txt")
for (i in 1:length(txt_names)){assign(txt_names[i], read.delim(txt_names[i]))
DS1 <- read.delim(file = txt_names[i], header = TRUE, stringsAsFactors = TRUE)
require(stringr)
remove_text <- str_extract(txt_names, pattern = "\\[[0-9,0-9]+\\]")
step1 <- gsub("(\\[)", "", remove_text)
step2 <- gsub("(\\])", "", step1)
DS2<-as.data.frame(do.call("rbind", (str_split(step2, ","))))
DS1$V1<-DS2$V1
DS1$V2<-DS2$V2
My issue arises when tying to sum "X_Parameter" and "V1" to make "absoluteX" and sum "Y_Parameter"with "V2" to make "absoluteY" for each row.
below are the two ways I have tried with the errors
DS1$absoluteX<-DS1$X_Parameter+DS1$V1
error
In Ops.factor(DS1$X_Parameter, DS1$V1) : ‘+’ not meaningful for factors
other try was
DS1$absoluteX<-rowSums(DS1[,c(“X_Parameter”,”V1”)])
error
Error in rowSums(DS1[, c("X_Parameter", "V1")]) : 'x' must be numeric
I have tried using
as.numeric(DS1$V1)
that causes all values to become 1
Any thoughts?Thanks
You can extract the numbers from a vector of file names as follows (not sure it is the shortest possible code, but it seems to work)
fnams<-c("file1[1000,2000].txt","file1[1500,2500].txt")
opsqbr<-regexpr("\\[",fnams)
comm<-regexpr(",",fnams)
clsqbr<-regexpr("\\]",fnams)
reslt<-data.frame(col1=as.numeric(substring(fnams,opsqbr+1,comm-1)),
col2=as.numeric(substring(fnams,comm+1,clsqbr-1)))
reslt
Which yields
col1 col2
1 1000 2000
2 1500 2500
Once you have this data frame,it is easy to sequentially read the files and do the addition
## set path to wherever your files are
setwd("path")
## make a vector with names of your files
txt_names <- list.files(pattern = ".txt") # use this to make a complete list of names
## read your files in
for (i in 1:length(txt_names)) assign(txt_names[i], read.csv(txt_names[i], sep = "whatever your separator is"))
## for now I'm making a dummy vector and data frame
txt_names <- c("[1000,2000]")
ds1 <- data.frame(column1 = c(1,2), column2 = c(2,4))
## grab the text you require from the file names
require(stringr)
remove_text <- str_extract(txt_names, pattern = "\\[[0-9,0-9]+\\]")
step1 <- gsub("(\\[)", "", remove_text)
step2 <- gsub("(\\])", "", step1)
## step2 should look like this
> step2
[1] "1000,1001"
## split each string and convert to data frame with two columns
ds2 <- as.data.frame(do.call("rbind", (str_split(step2, ","))))
## cbind with the file
df <- cbind(ds1, ds2)
## coerce factor columns to numeric
df$V1 <- as.numeric(as.character(df$V1))
df$V2 <- as.numeric(as.character(df$V2))
## perform the operation to change the columns
df$V1 <- df$column1 + df$V1
df$V2 <- df$column2 + df$V2
NOw you have a data.frame with two columns , each containing the file name parts you need. Just rep them times length of each of your data.frames and cbind.

reading a table in R?

I have a txt file with the following structure:
NAME DATA1 DATA2
a 10 1,2,3
b 6 8,9
c 20 5,6,7 ,8
The first line represent the header and the data is separated by tabs. I need to put the elements of DATA1 in a list or vector in a way that I can traverse the elements one by one.
Also I need to extract the elements of DATA2 for each NAME and to put them in a list so I can traverse then individually, e.g. get the elements 8,9 for NAME b and put it into a list. (Note that the third record has a space in the list in DATA2 between the 7 and the comma).
How I can do that both operations? I know that I can use read.table and $ for accessing individual elements, but I am stuck.
info<-read.table("table1", header=FALSE,sep="\t")
namelist<-list(info$NAME)
Run this demo and look at the structure of n, d1, and d2 -- that should help you get going:
df = read.table(text="NAME\tDATA1\tDATA2
a\t10\t1,2,3
b\t6\t8,9
c\t20\t5,6,7 ,8",
header= TRUE,
stringsAsFactors=FALSE,
sep='\t')
n = df$NAME
d1 = df$DATA1
d2 = lapply(strsplit(df$DATA2, ","),
as.numeric)
names(d2) = n
d2['b'][1] # access first element in list named 'b'
lapply(d2, FUN=mean) # mean of all rows in d2

Resources