I have 10 data files in my current directory such as data-01, data-02, data-03, data-04 till data-10.
Each of these data files has a few hundred rows with 4 fields. I would like to add a new column name "ID" and keep its ID like 01 (for data file data-01) for all the rows in that file.
A base R solution using a loop would go like this:
df<- c()
for (x in list.files(pattern="*.csv")) {
u<-read.table(x)
u$Label = factor(x)
df <- rbind(df, u)
cat(x, "\n ")
}
This depends on your data files having the same number of columns (though you get get around that inside the loop by selecting which columns you need before rbind) and then you can set whichever filetype you are looking at. The cat is useful because you can better trace read problems (because there are always problems). I bet there is a better way to do this with apply as well.
Related
I am working on creating a data frame column summaries for multiple columns in R using SQL (sqldf). The reason I went with this method instead of doing it directly in SQL is because there was some additional plotting I had to do and my manager wanted me to keep all code in one file.
This is the basic example of how the syntax for sqldf looks like:
Result1 <- sqldf("
SELECT ColumnName,
COUNT(ColumnName) as Total,
FROM AnalysisFile
GROUP BY ColumnName")
Overall, the way I completed the task is: 1) import the file, 2) run 6-7 separate queries for the columns of interest, 3) write all of them as a single .xlsx file using this code:
column_names <- list('Result1' = Result1, 'Result2' = Result2)
path_out = "C:\Users\Desktop"
write.xlsx(column_names, file = paste(path_out, 'SummaryResults', Sys.Date(), '.xlsx'))
Everything works fine, but here is the challenge: my manager wants me to create a FOR loop that only uses a vector with column names as an input, so that we won't have to change the query/output code every time we want to run this process for different columns or documents.
This means, I have to create a loop that will iterate through column names, appending them to different places in the sqldf string, creating multiple data frames, then concatenating them together into a single .xlsx file for sharing with other departments.
This is an example of what I attempted (I'm more used to Python and only have 1 year of coding experience in general, so forgive me for terrible syntax):
ColumnNames <- c('Column1','Column2')
for (i in ColumnNames) {
( "
SELECT " + i + "
COUNT(" + i + ") as Total,
ROUND(count(" + i + ") * 100.0/SUM(COUNT("+ i +")) over(), 2) as Total_Percentage
from AnalysisFile
group by " + i+"
)
}
This obviously gives me multiple errors and I am not sure how to approach this issue, let alone also automate the output by iterating through column names in the next chunk of code. If there is a better way to achieve the result, this will work as well.
Any help is appreciated!
Currently, I am working with a long list of files.
They have a name pattern of SB_xxx_(parts). (different extensions), where xxx refers to an item code.
SB_19842.png
SB_19842_head.png
SB_19842_hand.png
SB_19842_head.pdf
...
It is found that many of these codes have incorrect entries.
I got two columns in hand: One is for old codes and one is new codes (let's say A & B). I hope to change all those old codes in the file names to the new code.
old new
12154 24124
92482 02425
.....
My first thought is to use file.rename()
However, it is a one-to-one changing approach. I cannot do this because every item has a different number of parts and different file extensions.
Is there any recursive method that can simply change all incorrect file names with strings in A and replace them with strings in B? Anyone get an idea, please?
A loop solution with purrr::map2 at the end:
library(purrr)
#create files to rename
file.create("SB_19842.png")
file.create("SB_19842_head.png")
file.create("SB_19842_hand.png")
file.create("SB_19842_head.pdf")
file.create("SB_12154.png")
file.create("SB_12154_head.png")
file.create("SB_12154_hand.png")
file.create("SB_12154_head.pdf")
# a dataframe with old a nd new patterns
file_names <- data.frame(
old = c("19842", "12154"),
new = c("new1", "new2")
)
# old filenames from the directory, specify path if needed
file_names_SB <- list.files(pattern = "SB_")
# function to substitute one type of code with another
sub_one_code <- function(old_code, new_code, file_names_original){
gsub(paste0("SB_", old_code), paste0("SB_", new_code), file_names_original)
}
# loop to substitute all codes
new_file_names <- file_names_SB
for (row in 1:nrow(file_names)){
new_file_names <- sub_one_code(file_names[row, "old"], file_names[row, "new"], new_file_names)
}
# rename all the files
map2(file_names_SB,
new_file_names,
file.rename)
#thelatemail provided a link with more elegant solutions for generating new file names.
It's me the newbie again with another messy file and folder situation(thanks to us biologiests): I got this directory containing a huge amount of .txt files (~900,000+), all the files have been previously handed with inconsistent naming format :(
For example, messy files in directory look like these:
ctrl_S978765_uns_dummy_00_none.txt
ctrl_S978765_3S_Cookie_00_none.txt
S59607_3S_goody_3M_V10.txt
ctrlnuc30-100_S3245678_DMSO_00_none.txt
ctrlRAP_S0846567_3S_Dex_none.txt
S6498432_2S_Fulra_30mM_V100.txt
.....
As you see the naming has no reliable consistency. What's important for me is the ID code embedded in them, such as S978765. Now I have got a list (100 ID codes) of these ID codes that I want.
The CSV file containing the list as below, mind you the list does have repetitive ID codes in the row due to different CLnumber value in the second columns:
ID code CLnumber
S978765 1
S978765 2
S306223 1
S897458 1
S514486 2
....
So I want to achieve below task: find all the messy named files using the code IDs by matching to my list. And copy them into a new directory.
I have thought of use list.files() to get all the .txt files and their names, then I got stuck at the next step at matching the code ID names, I know how to do it with one string, say "S978765", but if I do it one by one, this is almost just like manual digging the folder.
How could I feed the ID code names in column1 as a list and compare/match them with the messy file title names in the directory and then copy them into a new folder?
Many thanks,
ML
This works:
library(stringr)
# get this via list.files in your actual code
files <- c("ctrl_S978765_uns_dummy_00_none.txt",
"ctrl_S978765_3S_Cookie_00_none.txt",
"S59607_3S_goody_3M_V10.txt",
"ctrlnuc30-100_S3245678_DMSO_00_none.txt",
"ctrlRAP_S0846567_3S_Dex_none.txt",
"S6498432_2S_Fulra_30mM_V100.txt")
ids <- data.frame(`ID Code` = c("S978765", "S978765", "S306223", "S897458", "S514486"),
CLnumber = c(1, 2, 1, 1, 2),
stringsAsFactors = FALSE)
str_subset(files, paste(ids$ID.Code, collapse = "|"))
#> [1] "ctrl_S978765_uns_dummy_00_none.txt" "ctrl_S978765_3S_Cookie_00_none.txt"
str_subset takes a character vector and returns elements matching some pattern. In this case, the pattern is "S978765|S978765|S306223|S897458|S514486" (created by using paste), which is a regular expression that matches any of the ID codes separated by |. So we take files and keep only the elements that have a match in ID Code.
There are many other ways to do this, which may or may not be more clear. For example, you could pass ids$ID.Code directly to str_subset instead of constructing a regular expression via paste, but that would throw a warning about object lengths every time, which could get confusing (or cause problems if you get used to ignoring it and then ignore it in a different context where it matters). Another method would be to use purrr and keep, but while that might be a little bit more clear to write, it would be a lot more inefficient since it would mean making multiple passes over the files vector -- not relevant in this context, but possibly very relevant if you suddenly need to do this for hundreds of thousands of files and IDs.
You could use regex to extract the ID codes from the file name.
Here, I have used the pattern "S" followed by 5 or more numbers. Once we extract the ID_codes, we can compare them with the ones which we have in csv.
Assuming the csv is called df and the column name is ID_Codes we can use %in% to filter them.
We can then use file.copy to move files from one folder to another folder.
all_files <- list.files(path = '/Path/To/Folder', full.names = TRUE)
selected_files <- all_files[sub('.*(S\\d{5,}).*', '\\1', basename(all_files))
%in% unique(df$ID_Codes)]
file.copy(selected_files, 'new_path/for/files')
I have variables that are named team.1, team.2, team.3, and so forth.
First of all, I would like to know how to go through each of these and assign a data frame to each one. So team.1 would have data from one team, then team.2 would have data from a second team. I am trying to do this for about 30 teams, so instead of typing the code out 30 times, is there a way to cycle through each with a counter or something similar?
I have tried things like
vars = list(sprintf("team.x%s", 1:33)))
to create my variables, but then I have no luck assigning anything to them.
Along those same lines, I would like to be able to run a function I made for cleaning and sorting the individual data sets on all of them at once.
For this, I have tried a for loop
for (j in 1:33) {
assign(paste("team.",j, sep = ""), cleaning1(paste("team.",j, sep =""), j))
}
where cleaning1 is my function, with two calls.
cleaning1(team.1, 1)
This produces the error message
Error in who[, -1] : incorrect number of dimensions
So obviously I am hoping the loop would count through my data sets, and also input my function calls and reassign my datasets with the newly cleaned data.
Is something like this possible? I am a complete newbie, so the more basic, the better.
Edit:
cleaning1:
cleaning1 = function (who, year) {
who[,-1]
who$SeasonEnd = rep(year, nrow(who))
who = (who[-nrow(who),])
who = tbl_df(who)
for (i in 1:nrow(who)) {
if ((str_sub(who$Team[i], -1)) == "*") {
who$Playoffs[i] = 1
} else {
who$Playoffs[i] = 0
}
}
who$Team = gsub("[[:punct:]]",'',who$Team)
who = who[c(27:28,2:26)]
return(who)
}
This works just fine when I run it on the data sets I have compiled myself.
To run it though, I have to go through and reassign each data set, like this:
team.1 = cleaning1(team.1, 1)
team.2 = cleaning1(team.2, 2)
So, I'm trying to find a way to automate that part of it.
I think your problem would be better solved by using a list of data frames instead of many variables containing one data frame each.
You do not say where you get your data from, so I am not sure how you would create the list. But assuming you have your data frames already stored in the variables team.1 etc., you could generate the list with
team.list <- list(team.1, team.2, ...,team.33)
where the dots stand for the variables that I did not write explicitly (you will have to do that). This is tedious, of course, and could be simplified as follows
team.list <- do.call(list,mget(paste0("team.",1:33)))
The paste0 command creates the variable names as strings, mget converts them to the actual objects, and do.call applies the list command to these objects.
Now that you have all your data in a list, it is much easier to apply a function on all of them. I am not quite sure how the year argument should be used, but from your example, I assume that it just runs from 1 to 33 (let me know, if this is not true and I'll change the code). So the following should work:
team.list.cleaned <- mapply(cleaning1,team.list,1:33)
It will go through all elements of team.list and 1:33 and apply the function cleaning1 with the elements as its arguments. The result will again be a list containing the output of each call, i.e.,
list( cleaning1(team.list[[1]],1), cleaning1(team.list[[2]],2), ...)
Since you are now to R I strongly recommend that you read the help on the apply commands (apply, lapply, tapply, mapply). There are very useful and once you got used to them, you will use them all the time...
There is probably also a simple way to directly generate the list of data frames using lapply. As an example: if the data frames are read in from files and you have the file names stored in a character vector file.names, then something along the lines of
team.list <- lapply(file.names,read.table)
might work.
I'm using base::paste in a for loop:
for (k in 1:length(summary$pro))
{
if (k == 1)
mp <- summary$pro[k]
else
mp <- paste(mp, summary$pro[k], sep = ",")
}
mp comes out as one big string, where the elements are separated by commas.
For example mp is "1,2,3,4,5,6"
Then, I want to put mp in a file, where each of its elements is added to a separate column in the same row. My code for this is:
write.table(mp, file = recompdatafile, sep = ",")
However, mp just appears in the CSV as one big string as opposed to being divided up. How can I achieve my desired format?
FYI
I've also tried converting mp to a list, and strsplit()-ing it, neither of which have worked.
Once I've added summary$pro to the file, how can I also add summary$me (which has the same format), in one row with multiple columns?
Thanks,
n.i.
If you want to write something to a file, write.table() isn't the only way. If you want to avoid headers and quotes and such, you can use the more direct cat. For example
cat(summary$pro, sep=",", file="filename.txt")
will write out the vector of values from summary$pro separated by commas more directly. You don't need to build a string first. (And building a string one element at a time as you did above is a bad practice anyway. Most functions in R can operate on an entire vector at a time, including paste).