Create list of one sub-column based on another column - r

I have a data set that looks like:
Files Batch
filepath1.txt One
filepath2.txt One
filepath3.txt One
filepath4.txt One
filepath5.txt two
filepath6.txt two
filepath7.txt two
filepath8.txt two
I want to loop over the full data set (that has a dozen "Batch" categories) by creating groups of "Files" that is based on what "Batch" they're in, in a new variable called "batch"
i.e.
batch[1]
filepath1.txt
filepath2.txt
filepath3.txt
filepath4.txt
batch[2]
filepath5.txt
filepath6.txt
filepath7.txt
filepath8.txt
How do I do this for all my Batch groups in the full data-set?

The split function seems to be what you're looking for.
> dat <- data.frame(File = paste0("file", 1:10, ".txt"), Batch = rep(c("one", "two"), each = 5))
> dat
File Batch
1 file1.txt one
2 file2.txt one
3 file3.txt one
4 file4.txt one
5 file5.txt one
6 file6.txt two
7 file7.txt two
8 file8.txt two
9 file9.txt two
10 file10.txt two
> split(dat, dat$Batch)
$one
File Batch
1 file1.txt one
2 file2.txt one
3 file3.txt one
4 file4.txt one
5 file5.txt one
$two
File Batch
6 file6.txt two
7 file7.txt two
8 file8.txt two
9 file9.txt two
10 file10.txt two

Related

R: combine two csv files with spark

I have two very large csv files and I'm using spark with R. My first file was uploaded this way:
data <- spark_read_csv(sc, "D:/my_file.csv")
After working with first file I have these variables:
Name | Number
The second csv file that has these variables:
Name | Number | Surname
You can also see that the second file has one more variable than the first. I would like to ignore the Surname column of the second file when loading with spark. How can I combine the two files so that the second is the continuum of the first?
From what I gather, you want to get rid of the Surname column in your second dataframe and make a union with the first.
spark_read_csv seems to come from sparklyr that I have never used but in plain SparkR, we could read data like below. I am pretty sure that the rest of the code would work the same way, regardless of the way the data is read.
> d1 = read.df(".../f1.csv", "csv", header="true")
> head(d1)
Name Number
1 x 7
2 y 8
> d2 = read.df(".../f2.csv", "csv", header="true")
> head(d2)
Name Number Surname
1 z 5 zz
2 w 6 ww
Then, it is pretty straightforward:
> trimmed_d2 = select(d2, "Name", "Number")
> all_the_data = union(d1, trimmed_d2)
> head(all_the_data)
Name Number
1 x 7
2 y 8
3 z 5
4 w 6

Removing specific columns from multiple data frames (.tab) and then merging them in R

I have 24 ".tab" files in a folder with names file1.tab, file2.tab, ..... file24.tab. Each of the files is a dataframe with 4 columns and 50,000 rows: The file looks like the image attached-
This is how each of the dataframe file looks like.
The first column is same in all the 24 files, but columns 2,3 and 4 have different values in each of the 24 files. For me, the columns 3 and 4 of each dataframe are irrelevant. I can get rid of the columns in each dataframe individually by following steps :
filenames <- Sys.gob("*.tab") #reads all the 24 file names
dataframe1 <- read.tab(filenames[1])
dataframe1 <- dataframe1[, -c(3,4)] #removes 3rd and 4th column of dataframe
However, this becomes very hectic when I have to repeat the above operation individually on 24 (or more) files which are similar. Is there a way to perform the above operation i.e. removing 3rd and 4th columns from all the 24 files by one code ?
Second part:
After removing the 3rd and 4th columns from each of the 24 files, I want to create a new dataframe which has 25 columns, such that the first column is the Column1 (which is same in all the files) and the subsequent columns are column2 from each of the files.
For two dataframes df1 and df2, I use :
merge(df1,df2,1,1)
and it creates a new data frame. It would be extremely tedious to do the merge operation individually for 24 modified dataframes. Could you please help me?
PS - I tried to find answers to any similar question (if asked before) and could not find it. So, in case it is marked as duplicate, it would be very kind if you please put a link to where it has been answered.
I have just started learning R and have no prior experience.
Regards,
Kshitij
First lets make a list of fake files
fakefile <- 'a\tb\tc\td
1\t2\t3\t4'
# In your case instead oof the string it would be the name of the file,
# and therefore it would not have the `text` argument
str(read.table(text = fakefile, header = TRUE))
## 'data.frame': 1 obs. of 4 variables:
## $ a: int 1
## $ b: int 2
## $ c: int 3
## $ d: int 4
# This list would be analogous to your `filenames` list
fakefile_list <- rep(fakefile, 20)
str(fakefile_list)
## chr [1:20] "a\tb\tc\td\n1\t2\t3\t4" "a\tb\tc\td\n1\t2\t3\t4" ...
In principle, all solutions will have the same underlying work as a list
and then merge concept (although the merge might be different here and there).
Solution 1 - If you can rely on the order of column 1
If you can rely on the ordering of the columns, then you dont really need to
read columns 1 and 4 of each file, but just col 4 and bind them.
# Reading column 1 once....
col1 <- read.table(text = fakefile_list[1], header = TRUE)[,1]
# Reading cols 4 in all files
# We first make a function that does our tasks (reading and removing cols)
reader_fun <- function(x) {
read.table(text = x, header = TRUE)[,4]
}
# Then we use lapply to use that function on each elment of our list
cols4 <- lapply(fakefile_list, FUN = reader_fun)
str(cols4)
## List of 20
## $ : int 4
## $ : int 4
## $ : int 4
## $ : int 4
# Then we use do.call and cbind to merge all of them as a matrix
cols4_mat <- do.call(cbind, cols4)
# And finally add column 1 to it
data.frame(col1, cols4_mat)
## col1 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19
## 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## X20
## 1 4
Solution 2 - If you can not rely in the order
The implementation is easier but it is a lot slower in most situations
# In your case it would be like this ...
# lapply(fakefile_list, FUN = function(x) read.table(x)[, c(1,4)], header = TRUE)
# But since im passing text and not file names ...
my_contents <- lapply(fakefile_list, FUN = function(x, ...) read.table(text = x, ...)[, c(1,4)], header = TRUE)
# And now we use full join and Reduce to merge everything
Reduce(function(x,y) dplyr::full_join(x,y, by = 'a') , my_contents)
## a d.x d.y d.x.x d.y.y d.x.x.x d.y.y.y d.x.x.x.x d.y.y.y.y d.x.x.x.x.x
## 1 1 4 4 4 4 4 4 4 4 4
## d.y.y.y.y.y d.x.x.x.x.x.x d.y.y.y.y.y.y d.x.x.x.x.x.x.x d.y.y.y.y.y.y.y
## 1 4 4 4 4 4
## d.x.x.x.x.x.x.x.x d.y.y.y.y.y.y.y.y d.x.x.x.x.x.x.x.x.x
## 1 4 4 4
## d.y.y.y.y.y.y.y.y.y d.x.x.x.x.x.x.x.x.x.x d.y.y.y.y.y.y.y.y.y.y
## 1 4 4 4
# you will need to modify the column names btw ...
Bonus - And the most concise solution ...
Depending on how big your data sets are, you might want to ignore the extra
columns from the start (instead of reading them and then removing them).
You can use fread from the data.table package to do that for you.
reader_function <- function(x) {
data.table::fread(x, select = c(1,4))
}
my_contents <- lapply(fakefile_list, FUN = reader_function)
Reduce(function(x,y) dplyr::full_join(x,y, by = 'a') , my_contents)
While the answer above by Sebastian worked perfectly fine, I myself figured out another way to solve the above question using the for-loop. So, I am sharing that solution in case anyone else has similar question and feels comfortable using this method.
First of all, I set the working directory to the folder which contains the files. This is done using setwd() command.
setwd("/absolute path to the folder containing files/") #set working directory to the folder containing files
Now, I define the path to the files so that I can list the files.
path <- "/absolute path to the folder containing files/" #define the path to the folder
I create the list of filenames that I am interested in.
filenames<- dir(path, "*.tab") #List the files in the folder
Now, I create a new file with the Column 1 and Column 2 of the first file by following code
out_file<- read.table(filenames[1])[,c(1:2)] #create an output file with column1 and column2 of the first file
I write a for-loop that now reads only the second column of the files 2 to 24, and adds this second column from each of the files to the out_file defined above.
for(i in 2:length(filenames)){ #iterates from the second file as the first 2 columns of the first file has already been assigned to out_file
file<-read.table(filenames[i], header=FALSE, stringsAsFactors= FALSE) #reads files
out_file<- cbind(out_file, file[,2]) #adds second column of each file
}
What the above code actually does is that it iterates through each of the files, extracts the column 2 and adds it to the out_file, thereby creating the file of my interest.

Filter rows based on ID over multiple data frames with for loop

How can I filter 180 .csv files from my global directory based on a matching ID in another df named 'Camera' in R? When I tried to incorporate my one by one file filtering code (see step 3b) into a for-loop (see step 3a) I get the error:
Error in paste("i")$SegmentID : $ operator is invalid for atomic vectors.
I'm quite new to for loop functions, so I really appreciate your help! All the 180 files have a unique name, are different in length, but have the same column structure & names. They look like:
df 'File1' df 'Camera'
ID Speed Location ID Time
1 30 4 1 10
2 35 5 3 11
3 40 6 5 12
4 30 7
5 35 8
Filtered df 'File1'
ID Speed Location
1 30 4
3 40 6
5 35 8
These are some samples of my code:
#STEP 1: read files
filenames <- list.files(path="06-06-2017_0900-1200uur",
pattern="*.csv")
# STEP 2: import files
for(i in filenames){
filepath <- file.path("06-06-2017_0900-1200uur",paste(i))
assign(i, read.csv2(filepath, header = TRUE, skip = "1"))
}
# STEP 3a: delete rows that do not match ID in df 'Cameras'
for(i in filesnames){
paste("i") <- paste("i")[paste("i")$ID %in% Cameras$ID,]
}
#STEP 3b: filtering one by one
File1 <- File1[File1$ID %in% Camera$ID,]
Here is an approach that makes use of lists (generally a better way to go). First, utilize the include.names argument in list.files():
fns <- list.files(
path = "06-06-2017_0900-1200uur",
pattern = "*.csv",
include.names = T
)
Now you have a list of your filenames. Next, apply read.csv2 to each of the filenames in your list:
dat <- lapply(fns, read.csv2, header = T, skip = 1)
Now you have a list of data frames (the output from calling read.csv). Finally, apply subset() to each of the data frames to keep only those rows which match the ID column:
out <- lapply(dat, function(x) subset(x, ID %in% Camera$ID))
If I understand the question, the output should be a data frame from file1 where the ID for all rows matches one of the rows in the Camera file.
This is easily accomplished with the sqldf() package and structured query language.
rawFile1 <- "ID Speed Location
1 30 4
2 35 5
3 40 6
4 30 7
5 35 8
"
rawCamera <- " ID Time
1 10
3 11
5 12
"
file1 <- read.table(textConnection(rawFile1),header=TRUE)
Camera <- read.table(textConnection(rawCamera),header=TRUE)
library(sqldf)
sqlStmt <- "select * from file1 where ID in(select ID from Camera)"
sqldf(sqlStmt,drv="SQLite")
...and the output:
ID Speed Location
1 1 30 4
2 3 40 6
3 5 35 8
>
To extend this logic to a number of csv files, first we obtain the list of files from the subdirectory where they are stored using the list.files() function. For example, if the files were in a data subdirectory of the R working directory, one might use the following function call.
theFiles <- list.files("./data/",".csv",full.names=TRUE)
We can read these files with read.table() to create a list() of data frames.
theData <- lapply(theFiles,function(x) {
read.table(x,header=TRUE)})
To combine the files into a single data frame, we execute do.call().
combinedData <- do.call(rbind,theData)
Now we can read the camera data and use sqldf to keep only the IDs matching the camera data.
Camera <- read.table(...,header=TRUE)
library(sqldf)
sqlStmt <- "select * from combinedData where ID in(select ID from Camera)"
sqldf(sqlStmt,drv="SQLite")

Search and term variables by character string

I have a maybe simple problem, but I can't solve it.
I have two list's. List A is empty and a list B has several named columns. Now, I want to select a colum of B by a variable and put it in list A. Somehow like shown in the example:
A<-list()
B<-list()
VAR<-"a"
B$a<-c(1:10)
B$b<-c(10:20)
B$c<-c(20:30)
#This of course dosn't work...
A$VAR<-B$VAR
You can extract list entry with B[[VAR]] and append new entry to a list using get (A[[get("VAR")]] <- newEntry):
A[[get("VAR")]] <- B[[VAR]]
## A list
# $a
# [1] 1 2 3 4 5 6 7 8 9 10

Using R to list and mark multiple csv files with characters from the title of those files, and put those in a dataframe

I have a large number of files that are all numbered and labeled from a CTD cast. These files all contain 3 columns, for bottle number fired, Depth, and Conductivity, and 3 rows, one for each water bottle fired.
1,68.93,0.2123
2,14.28,0.3139
3,8.683,0.3547
These files are named after the cast number as such "OS1505xxx.csv", where the xxx is the cast number. I would like to take the data from multiple casts, label the data with the cast number(which I presume would go in another column for each bottle sample), and then merge that data together in one dataframe.
1,68.93,0.2123,001
2,14.28,0.3139,001
3,8.683,0.3547,001
1,109.5,0.2062,002
2,27.98,0.4842,002
3,5.277,0.3705,002
One other thing, some files only have 1 or 2 bottles fired, While others also have 4 bottles fired. I tried finding files with only 3 rows and making a list of the filenames repeated three times, and then mergeing that with the binded csv files that had three rows into a dataframe but I am very new to R and couldn't figure it out. Any help is appreciated.
This gets all of them into one data frame in order (001-100), and from there you can export it however you want.
df <- data.frame(matrix(ncol = 4, nrow = 1))
colnames(df) <- c("V1", "V2", "V3", "file")
for(i in 1:100) {
file_name <- paste("OS1505",as.name(sprintf("%03d", i)),".csv",sep="")
if(file.exists(file_name)) {
print("match found")
df_tmp <- read.csv(file_name, header = FALSE, sep = ",",fill = TRUE)
df_tmp$file <- sprintf("%03d", i)
df <- rbind(df, df_tmp)
}
}
Try this:
files <- list.files(pattern="OS1505")
lst <- lapply(files, read.csv)
ids <- substr(files, 7,9)
for(i in 1:length(lst)) lst[[i]][,4] <- ids[i]
do.call(rbind, lst)
# X V1 V2 V3
#1 1 1 68.930 001
#2 2 2 14.280 001
#3 3 3 8.683 001
#4 1 1 109.500 002
#5 2 2 27.980 002
#6 3 3 5.277 002
We start by first creating two dummy files to try and save them as csv files to test. I named them in a way to match your files. (i.e. "OS1505001.csv"):
file1 <- read.table(text="
1,68.93,0.2123
2,14.28,0.3139
3,8.683,0.3547", sep=',')
file2 <- read.table(text="
1,109.5,0.2062
2,27.98,0.4842
3,5.277,0.3705", sep=',')
write.csv(file1, "OS1505001.csv")
write.csv(file2, "OS1505002.csv")
Going through the code, files checks the directory for any files that have OS1505 in them. There are two files that match that description "OS1505001.csv" "OS1505002.csv". We bring those two files into R with read.csv. It is wrapped in lapply so that the process can happen to all of the files in the files vector at once and saved in a list called lst. Now ids is a way to grab the id numbers from the file names. In a for loop we assign each id to the 4th column of the data frames. Lastly, do.call brings it all together with the rbind function.

Resources