R: combine two csv files with spark - r

I have two very large csv files and I'm using spark with R. My first file was uploaded this way:
data <- spark_read_csv(sc, "D:/my_file.csv")
After working with first file I have these variables:
Name | Number
The second csv file that has these variables:
Name | Number | Surname
You can also see that the second file has one more variable than the first. I would like to ignore the Surname column of the second file when loading with spark. How can I combine the two files so that the second is the continuum of the first?

From what I gather, you want to get rid of the Surname column in your second dataframe and make a union with the first.
spark_read_csv seems to come from sparklyr that I have never used but in plain SparkR, we could read data like below. I am pretty sure that the rest of the code would work the same way, regardless of the way the data is read.
> d1 = read.df(".../f1.csv", "csv", header="true")
> head(d1)
Name Number
1 x 7
2 y 8
> d2 = read.df(".../f2.csv", "csv", header="true")
> head(d2)
Name Number Surname
1 z 5 zz
2 w 6 ww
Then, it is pretty straightforward:
> trimmed_d2 = select(d2, "Name", "Number")
> all_the_data = union(d1, trimmed_d2)
> head(all_the_data)
Name Number
1 x 7
2 y 8
3 z 5
4 w 6

Related

How can I read a double-semicolon-separated .txt in r?

I have this problem but in r:
How can I read a double-semicolon-separated .csv with quoted values using pandas?
The solution there is to drop the additional columns generated. I'd like to know if there's a way to read the file separated by ;; without generating those addiotional columns.
Thanks!
Read it in normally using read.csv2 (or whichever variant you prefer, including read.table, read.delim, readr::read_csv2, data.table::fread, etc), and then remove the even-numbered columns.
dat <- read.csv2(text = "a;;b;;c;;d\n1;;2;;3;;4")
dat
# a X b X.1 c X.2 d
# 1 1 NA 2 NA 3 NA 4
dat[,-seq(2, ncol(dat), by = 2)]
# a b c d
# 1 1 2 3 4
It is usually recommended to properly clean your data before attempting to parse it, instead of cleaning it WHILE parsing, or worse, AFTER. Either use Notepad++ to Replace all ;; occurences or R itself, but do not delete the original files (also a rule of thumb - never delete sources of data).
my.text <- readLines('d:/tmp/readdelim-r.csv')
cleaned <- gsub(';;', ';', my.text)
writeLines(cleaned, 'd:/tmp/cleaned.csv')
my.cleaned <- read.delim('d:/tmp/cleaned.csv', header=FALSE, sep=';')

Removing specific columns from multiple data frames (.tab) and then merging them in R

I have 24 ".tab" files in a folder with names file1.tab, file2.tab, ..... file24.tab. Each of the files is a dataframe with 4 columns and 50,000 rows: The file looks like the image attached-
This is how each of the dataframe file looks like.
The first column is same in all the 24 files, but columns 2,3 and 4 have different values in each of the 24 files. For me, the columns 3 and 4 of each dataframe are irrelevant. I can get rid of the columns in each dataframe individually by following steps :
filenames <- Sys.gob("*.tab") #reads all the 24 file names
dataframe1 <- read.tab(filenames[1])
dataframe1 <- dataframe1[, -c(3,4)] #removes 3rd and 4th column of dataframe
However, this becomes very hectic when I have to repeat the above operation individually on 24 (or more) files which are similar. Is there a way to perform the above operation i.e. removing 3rd and 4th columns from all the 24 files by one code ?
Second part:
After removing the 3rd and 4th columns from each of the 24 files, I want to create a new dataframe which has 25 columns, such that the first column is the Column1 (which is same in all the files) and the subsequent columns are column2 from each of the files.
For two dataframes df1 and df2, I use :
merge(df1,df2,1,1)
and it creates a new data frame. It would be extremely tedious to do the merge operation individually for 24 modified dataframes. Could you please help me?
PS - I tried to find answers to any similar question (if asked before) and could not find it. So, in case it is marked as duplicate, it would be very kind if you please put a link to where it has been answered.
I have just started learning R and have no prior experience.
Regards,
Kshitij
First lets make a list of fake files
fakefile <- 'a\tb\tc\td
1\t2\t3\t4'
# In your case instead oof the string it would be the name of the file,
# and therefore it would not have the `text` argument
str(read.table(text = fakefile, header = TRUE))
## 'data.frame': 1 obs. of 4 variables:
## $ a: int 1
## $ b: int 2
## $ c: int 3
## $ d: int 4
# This list would be analogous to your `filenames` list
fakefile_list <- rep(fakefile, 20)
str(fakefile_list)
## chr [1:20] "a\tb\tc\td\n1\t2\t3\t4" "a\tb\tc\td\n1\t2\t3\t4" ...
In principle, all solutions will have the same underlying work as a list
and then merge concept (although the merge might be different here and there).
Solution 1 - If you can rely on the order of column 1
If you can rely on the ordering of the columns, then you dont really need to
read columns 1 and 4 of each file, but just col 4 and bind them.
# Reading column 1 once....
col1 <- read.table(text = fakefile_list[1], header = TRUE)[,1]
# Reading cols 4 in all files
# We first make a function that does our tasks (reading and removing cols)
reader_fun <- function(x) {
read.table(text = x, header = TRUE)[,4]
}
# Then we use lapply to use that function on each elment of our list
cols4 <- lapply(fakefile_list, FUN = reader_fun)
str(cols4)
## List of 20
## $ : int 4
## $ : int 4
## $ : int 4
## $ : int 4
# Then we use do.call and cbind to merge all of them as a matrix
cols4_mat <- do.call(cbind, cols4)
# And finally add column 1 to it
data.frame(col1, cols4_mat)
## col1 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19
## 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## X20
## 1 4
Solution 2 - If you can not rely in the order
The implementation is easier but it is a lot slower in most situations
# In your case it would be like this ...
# lapply(fakefile_list, FUN = function(x) read.table(x)[, c(1,4)], header = TRUE)
# But since im passing text and not file names ...
my_contents <- lapply(fakefile_list, FUN = function(x, ...) read.table(text = x, ...)[, c(1,4)], header = TRUE)
# And now we use full join and Reduce to merge everything
Reduce(function(x,y) dplyr::full_join(x,y, by = 'a') , my_contents)
## a d.x d.y d.x.x d.y.y d.x.x.x d.y.y.y d.x.x.x.x d.y.y.y.y d.x.x.x.x.x
## 1 1 4 4 4 4 4 4 4 4 4
## d.y.y.y.y.y d.x.x.x.x.x.x d.y.y.y.y.y.y d.x.x.x.x.x.x.x d.y.y.y.y.y.y.y
## 1 4 4 4 4 4
## d.x.x.x.x.x.x.x.x d.y.y.y.y.y.y.y.y d.x.x.x.x.x.x.x.x.x
## 1 4 4 4
## d.y.y.y.y.y.y.y.y.y d.x.x.x.x.x.x.x.x.x.x d.y.y.y.y.y.y.y.y.y.y
## 1 4 4 4
# you will need to modify the column names btw ...
Bonus - And the most concise solution ...
Depending on how big your data sets are, you might want to ignore the extra
columns from the start (instead of reading them and then removing them).
You can use fread from the data.table package to do that for you.
reader_function <- function(x) {
data.table::fread(x, select = c(1,4))
}
my_contents <- lapply(fakefile_list, FUN = reader_function)
Reduce(function(x,y) dplyr::full_join(x,y, by = 'a') , my_contents)
While the answer above by Sebastian worked perfectly fine, I myself figured out another way to solve the above question using the for-loop. So, I am sharing that solution in case anyone else has similar question and feels comfortable using this method.
First of all, I set the working directory to the folder which contains the files. This is done using setwd() command.
setwd("/absolute path to the folder containing files/") #set working directory to the folder containing files
Now, I define the path to the files so that I can list the files.
path <- "/absolute path to the folder containing files/" #define the path to the folder
I create the list of filenames that I am interested in.
filenames<- dir(path, "*.tab") #List the files in the folder
Now, I create a new file with the Column 1 and Column 2 of the first file by following code
out_file<- read.table(filenames[1])[,c(1:2)] #create an output file with column1 and column2 of the first file
I write a for-loop that now reads only the second column of the files 2 to 24, and adds this second column from each of the files to the out_file defined above.
for(i in 2:length(filenames)){ #iterates from the second file as the first 2 columns of the first file has already been assigned to out_file
file<-read.table(filenames[i], header=FALSE, stringsAsFactors= FALSE) #reads files
out_file<- cbind(out_file, file[,2]) #adds second column of each file
}
What the above code actually does is that it iterates through each of the files, extracts the column 2 and adds it to the out_file, thereby creating the file of my interest.

Filter rows based on ID over multiple data frames with for loop

How can I filter 180 .csv files from my global directory based on a matching ID in another df named 'Camera' in R? When I tried to incorporate my one by one file filtering code (see step 3b) into a for-loop (see step 3a) I get the error:
Error in paste("i")$SegmentID : $ operator is invalid for atomic vectors.
I'm quite new to for loop functions, so I really appreciate your help! All the 180 files have a unique name, are different in length, but have the same column structure & names. They look like:
df 'File1' df 'Camera'
ID Speed Location ID Time
1 30 4 1 10
2 35 5 3 11
3 40 6 5 12
4 30 7
5 35 8
Filtered df 'File1'
ID Speed Location
1 30 4
3 40 6
5 35 8
These are some samples of my code:
#STEP 1: read files
filenames <- list.files(path="06-06-2017_0900-1200uur",
pattern="*.csv")
# STEP 2: import files
for(i in filenames){
filepath <- file.path("06-06-2017_0900-1200uur",paste(i))
assign(i, read.csv2(filepath, header = TRUE, skip = "1"))
}
# STEP 3a: delete rows that do not match ID in df 'Cameras'
for(i in filesnames){
paste("i") <- paste("i")[paste("i")$ID %in% Cameras$ID,]
}
#STEP 3b: filtering one by one
File1 <- File1[File1$ID %in% Camera$ID,]
Here is an approach that makes use of lists (generally a better way to go). First, utilize the include.names argument in list.files():
fns <- list.files(
path = "06-06-2017_0900-1200uur",
pattern = "*.csv",
include.names = T
)
Now you have a list of your filenames. Next, apply read.csv2 to each of the filenames in your list:
dat <- lapply(fns, read.csv2, header = T, skip = 1)
Now you have a list of data frames (the output from calling read.csv). Finally, apply subset() to each of the data frames to keep only those rows which match the ID column:
out <- lapply(dat, function(x) subset(x, ID %in% Camera$ID))
If I understand the question, the output should be a data frame from file1 where the ID for all rows matches one of the rows in the Camera file.
This is easily accomplished with the sqldf() package and structured query language.
rawFile1 <- "ID Speed Location
1 30 4
2 35 5
3 40 6
4 30 7
5 35 8
"
rawCamera <- " ID Time
1 10
3 11
5 12
"
file1 <- read.table(textConnection(rawFile1),header=TRUE)
Camera <- read.table(textConnection(rawCamera),header=TRUE)
library(sqldf)
sqlStmt <- "select * from file1 where ID in(select ID from Camera)"
sqldf(sqlStmt,drv="SQLite")
...and the output:
ID Speed Location
1 1 30 4
2 3 40 6
3 5 35 8
>
To extend this logic to a number of csv files, first we obtain the list of files from the subdirectory where they are stored using the list.files() function. For example, if the files were in a data subdirectory of the R working directory, one might use the following function call.
theFiles <- list.files("./data/",".csv",full.names=TRUE)
We can read these files with read.table() to create a list() of data frames.
theData <- lapply(theFiles,function(x) {
read.table(x,header=TRUE)})
To combine the files into a single data frame, we execute do.call().
combinedData <- do.call(rbind,theData)
Now we can read the camera data and use sqldf to keep only the IDs matching the camera data.
Camera <- read.table(...,header=TRUE)
library(sqldf)
sqlStmt <- "select * from combinedData where ID in(select ID from Camera)"
sqldf(sqlStmt,drv="SQLite")

Read only lines from a large text file which fulfill specific condition

I have a large file (data.txt, 35 GB) which has 3 columns.
Some example part of the file would look like the following:
... ... ...
5 701565 8679.56
8 1.16201e+006 3193.18
1 1.16173e+006 4457.85
14 1.16173e+006 4457.85
9 1.77942e+006 7208.73
4 1.78011e+006 8239.88
14 1.78019e+006 8195.57
9 2.00206e+006 8858.55
4 2.00199e+006 7924
... ... ...
I want to plot a histogram for the 3rd column when the values in the second column are between 0 and 50'000.
Then I want to do another histogram where the values of the first column are between 50'000 and 100'000. And so on, and so forth.
I don't know how to load/read only the data I need at a time. Any help would be appreciated!
If I should use the sqldf package then the question I have would be how I can say that the value of the 2nd column should be smaller than a e.g. 50'000?
The difference to How do i read only lines that fulfil a condition from a csv into R? is that I don't have any column names. Therefore I cannot do what they propose in their solution:
sql = "select * from file where Sepal.Length > 5"
I think recent versions of readr support this sort of thing. The following is just adapted from the help for readr::read_csv_chunked
library(readr)
f <- function(x, pos) subset(x, X3 > 0 & X3 < 50000)
df <- read_csv_chunked(
'test.csv',
DataFrameCallback$new(f),
chunk_size = 100000,
col_names = F
)

merging multiple dataframes with duplicate rows in R

Relatively new with R for this kind of thing, searched quite a bit and couldn't find much that was helpful.
I have about 150 .csv files with 40,000 - 60,000 rows each and I am trying to merge 3 columns from each into 1 large data frame. I have a small script that extracts the 3 columns of interest ("id", "name" and "value") from each file and merges by "id" and "name" with the larger data frame "MergedData". Here is my code (I'm sure this is a very inefficient way of doing this and that's ok with me for now, but of course I'm open to better options!):
file_list <- list.files()
for (file in file_list){
if(!exists("MergedData")){
MergedData <- read.csv(file, skip=5)[ ,c("id", "name", "value")]
colnames(MergedData) <- c("id", "name", file)
}
else if(exists("MergedData")){
temp_data <- read.csv(file, skip=5)[ ,c("id", "name", "value")]
colnames(temp_data) <- c("id", "name", file)
MergedData <- merge(MergedData, temp_data, by=c("id", "name"), all=TRUE)
rm(temp_data)
}
}
Not every file has the same number of rows, though many rows are common to many files. I don't have an inclusive list of rows, so I included all=TRUE to append new rows that don't yet exist in the MergedData file.
My problem is: many of the files contain 2-4 rows with identical "id" and "name" entries, but different "value" entries. So, when I merge them I end up adding rows for every possible combination, which gets out of hand fast. Most frustrating is that none of these duplicates are of any interest to me whatsoever. Is there a simple way to take the value for the first entry and just ignore any further duplicate entries?
Thanks!
Based on your comment, we could stack each file and then cast the resulting data frame from "long" to "wide" format:
library(dplyr)
library(readr)
library(reshape2)
df = lapply(file_list, function(file) {
dat = read_csv(file)
dat$source.file = file
return(dat)
})
df = bind_rows(df)
df = dcast(df, id + name ~ source.file, value.var="value")
In the code above, after reading in each file, we add a new column source.file containing the file name (or a modified version thereof).* Then we use dcast to cast the data frame from "long" to "wide" format to create a separate column for the value from each file, with each new column taking one of the names we just created in source.file.
Note also that depending on what you're planning to do with this data frame, you may find it more convenient to keep it in long format (i.e., skip the dcast step) for further analysis.
Addendum: Dealing with Aggregation function missing: defaulting to length warning. This happens when you have more than one row with the same id, name and source.file. That means there are multiple values that have to get mapped to the same cell, resulting in aggregation. The default aggregation function is length (i.e., a count of the number of values in that cell). The only ways around this that I know of are (a) keep the data in long format, (b) use a different aggregation function (e.g., mean), or (c) add an extra counter column to differentiate cases with multiple values for the same combination of id, name, and source.file. We demonstrate these below.
First, let's create some fake data:
df = data.frame(id=rep(1:2,2),
name=rep(c("A","B"), 2),
source.file=rep(c("001","002"), each=2),
value=11:14)
df
id name source.file value
1 1 A 001 11
2 2 B 001 12
3 1 A 002 13
4 2 B 002 14
Only one value per combination of id, name and source.file, so dcast works as desired.
dcast(df, id + name ~ source.file, value.var="value")
id name 001 002
1 1 A 11 13
2 2 B 12 14
Add an additional row with the same id, name and source.file. Since there are now two values getting mapped to a single cell, dcast must aggregate. The default aggregation function is to provide a count of the number of values.
df = rbind(df, data.frame(id=1, name="A", source.file="002", value=50))
dcast(df, id + name ~ source.file, value.var="value")
Aggregation function missing: defaulting to length
id name 001 002
1 1 A 1 2
2 2 B 1 1
Instead, use mean as the aggregation function.
dcast(df, id + name ~ source.file, value.var="value", fun.aggregate=mean)
id name 001 002
1 1 A 11 31.5
2 2 B 12 14.0
Add a new counter column to differentiate cases where there are multiple rows with the same id, name and source.file and include that in dcast. This gets us back to a single value per cell, but at the expense of having more than one column for some source.files.
# Add counter column
df = df %>% group_by(id, name, source.file) %>%
mutate(counter=1:n())
As you can see, the counter value only has a value of 1 in cases where there's only one combination of id, name, and source.file, but has values of 1 and 2 for one case where there are two rows with the same id, name, and source.file (rows 3 and 5 below).
df
id name source.file value counter
1 1 A 001 11 1
2 2 B 001 12 1
3 1 A 002 13 1
4 2 B 002 14 1
5 1 A 002 50 2
Now we dcast with counter included, so we get two columns for source.file "002".
dcast(df, id + name ~ source.file + counter, value.var="value")
id name 001_1 002_1 002_2
1 1 A 11 13 50
2 2 B 12 14 NA
* I'm not sure what your file names look like, so you'll probably need to adjust this create a naming format with a unique file identifier. For example, if your file names follow the pattern "file001.csv", "file002.csv", etc., you could do this: dat$source.file = paste0("Value", gsub("file([0-9]{3})\\.csv", "\\1", file).

Resources