How to partition to multiple .csv from df based on whitespace row? - r

I'm working with a database that has a timestamp, 3 numeric vectors, and a character vector.
Basically, each "set" of data is delineated by a new row. I need each series of rows to save as .csv when the row reads that each column is empty (x = \t\r\n). There's about 370 in my dataset.
For example,
library(dplyr)
data <- data.frame(x1 = 1:4,
x2 = 4:1,
x3 = 3,
x4 = c("text", "no text", "example", "hello"))
new_row <- c("\t\r\n", "\t\r\n", "\t\r\n", "\t\r\n")
data1 <- rbind(data, new_row)
data2 <- data.frame(x1 = 1:4,
x2 = 4:1,
x3 = 4,
x4 = c("text", "no text", "example", "hello"))
data2 <- rbind(data2, new_row)
data3 <- rbind(data1, data2)
view(data3)
This is what my data set looks like (without the timestamp). I need every set of consecutive rows after a row full or \t\r\n to be exported as an individual .csv.
I'm doing text analysis. Each group of rows, with highly variable group size, represents a thread on different subject. I need to analyze these individual threads.
What is the best way to go about doing this? I haven't had this problem before.

ind <- grepl("\t", data3$x4)
ind <- replace(cumsum(ind), ind, -1)
ind
# [1] 0 0 0 0 -1 1 1 1 1 -1
data4 <- split(data3, ind)
data4
# $`-1`
# x1 x2 x3 x4
# 5 \t\r\n \t\r\n \t\r\n \t\r\n
# 10 \t\r\n \t\r\n \t\r\n \t\r\n
# $`0`
# x1 x2 x3 x4
# 1 1 4 3 text
# 2 2 3 3 no text
# 3 3 2 3 example
# 4 4 1 3 hello
# $`1`
# x1 x2 x3 x4
# 6 1 4 4 text
# 7 2 3 4 no text
# 8 3 2 4 example
# 9 4 1 4 hello
The use of -1 was solely to keep the "\t\r\n" rows from being included in each of their respective groups, and we know that cumsum(ind) should start at 0. You can obviously drop the first frame :-)
From here, you can export with
data4 <- data4[-1]
ign <- Map(write.csv, data4, sprintf("file_%03d.csv", seq_along(data4)))

Related

Sum Values of Every Column in Data Frame with Conditional For Loop

So I want to go through a data set and sum the values from each column based on the condition of my first column. The data and my code so far looks like this:
x v1 v2 v3
1 0 1 5
2 4 2 10
3 5 3 15
4 1 4 20
for(i in colnames(data)){
if(data$x>2){
x1 <-sum(data[[i]])
}
else{
x2 <-sum(data[[i]])
}
}
My assumption was that the for loop would call each column by name from the data and then sum the values in each column based on whether they matched the condition of column x.
I want to sum half the values from each column and assign them to a value x1 and do the same for the remainder, assigning it to x2. I keep getting an error saying the following:
the condition has length > 1 and only the first element will be used
What am I doing wrong and is there a better way to go about this? Ideally I want a table that looks like this:
v1 v2 v3
x1 6 7 35
x2 4 3 15
Here's a dplyr solution. First, I define the data frame.
df <- read.table(text = "x v1 v2 v3
1 0 1 5
2 4 2 10
3 5 3 15
4 1 4 20", header = TRUE)
# x v1 v2 v3
# 1 1 0 1 5
# 2 2 4 2 10
# 3 3 5 3 15
# 4 4 1 4 20
Then, I create a label (x_check) to indicate which group each row belongs to based on your criterion (x > 2), group by this label, and summarise each column with a v in its name using sum.
# Load library
library(dplyr)
df %>%
mutate(x_check = ifelse(x>2, "x1", "x2")) %>%
group_by(x_check) %>%
summarise_at(vars(contains("v")), funs(sum))
# # A tibble: 2 x 4
# x_check v1 v2 v3
# <chr> <int> <int> <int>
# 1 x1 6 7 35
# 2 x2 4 3 15
Not sure if I understood your intention correctly, but here is how you would reproduce your results with base R:
df <- data.frame(
x = c(1:4),
v1 = c(0, 4, 5, 1),
v2 = 1:4,
v3 = (1:4)*5
)
x1 <- colSums(df[df$x > 2, 2:4, drop = FALSE])
x2 <- colSums(df[df$x <= 2, 2:4, drop = FALSE])
Where
df[df$x > 2, 2:4, drop = FALSE] will create a subset of df where the rows satisfy df$x > 2 and the columns are 2:4 (meaning the second, third and fourth column), drop = FALSE is there mainly to prevent R from simplifying the results in some special cases
colSums does a by-column sum on the subsetted data.frame
If your x column was really a condition (e.g. a logical vector) you could just do
x1 <- colSums(df[df$x, 2:4, drop = FALSE])
x2 <- colSums(df[!df$x, 2:4, drop = FALSE])
Note that there is no loop needed to get to the results, with R you should use vectorized functions as much as possible.
More generally, you could do such aggregation with aggregate:
aggregate(df[, 2:4], by = list(condition = df$x <= 2), FUN = sum)

Anti Merging Large DataSets with Multiple Conditions

2Suppose I have two data frames:
A <- data.frame(X1=c(1,2,3,4,5), X2=c(3,3,4,4,6), X3=c(3,2,14,5,4))
B <- data.frame(X1=c(1,3,5), X2=c(3,4,6))
I want to merge the two so that the when X1 and X2 in A are in a row in B, then those entire rows (with all columns) are returned from A. I have tried anti_join and merge, but the results are not working as planned and merge can not handle larger dataframes. I have also tried things with the data table package.
I would like the below dataframe to be returned or saved to a new object.
C <- data.frame(X1=c(2,4), X2=c(3,4), X3=c(2,5))
Wouldn't you just do A%>%anti_join(B, by = c("X1", "X2"))? That way you have the by set to both X1 and X2, and you get all the outliers.
> A <- data.frame(X1=c(1,2,3,4,5), X2=c(3,3,4,4,6), X3=c(3,2,14,5,4))
> B <- data.frame(X1=c(1,3,5), X2=c(3,4,6))
> A%>%inner_join(B, by = c("X1", "X2"))
X1 X2 X3
1 1 3 3
2 3 4 14
3 5 6 4
> A%>%anti_join(B, by = c("X1", "X2"))
X1 X2 X3
1 2 3 2
2 4 4 5

Combine two identical dataframe columns into comma seperated columns in R

I have two identically structured dataframe (same amount of rows, columns and same headers). What I would like to do is to combine the two into one dataframe that has comma seperated columns.
I know how to do it with this dummy data frames, but using it on my own data would be very cumbersome.
This are my dummy data frames, the headers of my "real" data are "1","2","3" etc. while those of the dummy data frames are "X1","X2","X3" etc.
> data1
X1 X2 X3 X4
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
> data2
X1 X2 X3 X4
1 8 9 13 14
2 9 10 14 15
3 10 11 15 16
What I would like:
>data3
new1 new2 new3 new4
1 1,8 2,9 3,13 4,14
2 2,9 3,10 4,14 5,15
3 3,10 4,11 5,15 6,16
How I managed to get this output. But, it is too cumbersome for a large dataset I think.:
data1<- data.frame('1'=1:3, '2'=2:4, '3'=3:5,'4'=4:6)
data2<- data.frame('1'=8:10, '2'=9:11, '3'=13:15,'4'=14:16)
names(data1) <- c("1a","2a","3a","4a")
names(data2) <- c("1b","2b","3b","4b")
data3<- cbind(data1,data2)
cols.1 <- c('1a','1b'); cols.2 <-c('2a','2b')
cols.3 <- c('3a','3b'); cols.4 <-c('4a','4b')
data3$new1 <- apply( data3[ , cols.1] , 1 , paste , collapse = "," )
data3$new2 <- apply( data3[ , cols.2] , 1 , paste , collapse = "," )
data3$new3 <- apply( data3[ , cols.3] , 1 , paste , collapse = "," )
data3$new4 <- apply( data3[ , cols.4] , 1 , paste , collapse = "," )
data3 <-data3[,c(9:12)]
Is there a way in which I can iterate this, perhaps with a for loop? Any help would be appreciated.
These posts are somehow similar:
Same question but for rows in stead of columns:
how to convert column values into comma seperated row vlaues
Similar, but didn't work on my large dataset:
Paste multiple columns together
using only base:
data1 <- data.frame(x1 = 1:3, x2 = 2:4, x3 = 3:5, x4 = 4:6)
data2 <- data.frame(x1 = 8:10, x2 = 9:11, x3 = 13:15, x4 = 14:16)
data3 <- mapply(function(x, y){paste(x,y, sep = ",")}, data1, data2)
data3 <- as.data.frame(data3)
x1 x2 x3 x4
1 1,8 2,9 3,13 4,14
2 2,9 3,10 4,14 5,15
3 3,10 4,11 5,15 6,16
Here's a basic for loop approach:
newdf = data.frame(matrix(ncol=ncol(data1),nrow=nrow(data1)))
for (i in 1:ncol(data1)) {
newdf[,i] = paste(data1[,i], data2[,i], sep=",")
}
#> newdf
# X1 X2 X3 X4
# 1 1,8 2,9 3,13 4,14
# 2 2,9 3,10 4,14 5,15
# 3 3,10 4,11 5,15 6,16
Line by line explanation:
initialize new empty dataframe of appropriate dimensions:
newdf = data.frame(matrix(ncol=ncol(data1),nrow=nrow(data1)))
loop through 1,2,..n columns and fill each column with the paste results:
for (i in 1:ncol(data1)) {
newdf[,i] = paste(data1[,i], data2[,i], sep=",")
}
Disclaimer that this may be very slow on large datasets - a dplyr or data.frame approach (and perhaps some v/s/apply*() statement) will be faster, if you are interested in learning those methods.

Merge in loop R

I am using a for loop to merge multiple files with another file:
files <- list.files("path", pattern=".TXT", ignore.case=T)
for(i in 1:length(files))
{
data <- fread(files[i], header=T)
# Merge
mydata <- merge(mydata, data, by="ID", all.x=TRUE)
rm(data)
}
"mydata" looks as follows (simplified):
ID x1 x2
1 2 8
2 5 5
3 4 4
4 6 5
5 5 8
"data" looks as follows (around 600 files, in total 100GB). Example of 2 (seperate) files. Integrating all in 1 would be impossible (too large):
ID x3
1 8
2 4
ID x3
3 4
4 5
5 1
When I run my code I get the following dataset:
ID x1 x2 x3.x x3.y
1 2 8 8 NA
2 5 5 4 NA
3 4 4 NA 4
4 6 5 NA 5
5 5 8 NA 1
What I would like to get is:
ID x1 x2 x3
1 2 8 8
2 5 5 4
3 4 4 4
4 6 5 5
5 5 8 1
ID's are unique (never duplicates over the 600 files).
Any idea on how to achieve this as efficiently as possible much appreciated.
It's better suited as comment, But I can't comment yet.
Would it not be better to rbind instead of merge?
This seems to be what you want to acomplish.
Set fill argument TRUE to take care of different column numbers:
asd <- data.table(x1 = c(1, 2), x2 = c(4, 5))
a <- data.table(x2 = 5)
rbind(asd, a, fill = TRUE)
x1 x2
1: 1 4
2: 2 5
3: NA 5
Do this with data and then merge into mydata by ID.
Update for comment
files <- list.files("path", pattern=".TXT", ignore.case=T)
ff <- function(input){
data <- fread(input)
}
a <- lapply(files, ff)
library(plyr)
binded.data <- ldply(a, function(x) rbind(x, fill = TRUE))
So, this creates a function to read files and pushes it to lapply, so you will get a list containing all your data files, each on its own dataframe.
With ldply from plyr rbind all dataframes into one dataframe.
Don't touch mydata yet.
binded.data <- data.table(binded.data, key = ID)
Depending on your mydata you will perform different merge commands.
See:
https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html
Update 2
files <- list.files("path", pattern=".TXT", ignore.case=T)
ff <- function(input){
data <- fread(input)
# This keeps only the rows of 'data' whose ID matches ID of 'mydata'
data <- data[ID %in% mydata[, ID]]
}
a <- lapply(files, ff)
library(plyr)
binded.data <- ldply(a, function(x) rbind(x, fill = TRUE))
Update 3
You can add cat to see the file the function is reading right now. So you can see after which file you are running out of memory. Which will point you to the direction on how many files you can read in one go.
ff <- function(input){
# This will print name of the file it is reading now
cat(input, "\n")
data <- fread(input)
# This keeps only the rows of 'data' whose ID matches ID of 'mydata'
data <- data[ID %in% mydata[, ID]]
}

Create new data frame depending on the most extreme value in rows

I have the following data frame and I would like to create a new one that will be like the one below.
ID1 ID2 ID3 ID4
x1_X 0 10 4 7
x2_X 2 12 5 8
x3_X 3 1 3 5
y1_Y 4 13 6 4
y2_Y 5 14 1 9
y3_Y 2 11 1 5
y4_Y 1 1 2 3
z1_Z 1 0 0 5
z2_Z 3 6 7 7
New data frame
ID1 ID2 ID3 ID4
X x3 x2 x2 x2
Y y2 y2 y1 y2
Z z2 z2 z2 z2
Basically the idea is the following:
For each ID I want to find which of the rownames (x1_X,x2_X,x3_X) has the most extreme value and assign this to name X since in the rownames I have subgroups.
My data frame is huge: 1700 columns and 100000 rows.
First we need to split the group and subgroup labels:
grp <- strsplit(row.names(df), "_")
And if performance is an issue, I think data.table is our best choice:
library(data.table)
df$group <- sapply(grp, "[", 2)
subgroup <- sapply(grp, "[", 1)
dt <- data.table(df)
And we now have access to the single line:
result <- dt[,lapply(.SD, function(x) subgroup[.I[which.max(x)]]), by=group]
Which splits the data.table by the character after the underscore (by=group) and then, for every column of the rectangular subset (.SD) we get the index in the sub-rectangle (which.max), and then map it back to the whole data.table (.I), and then extract the relevant subgroup (subgroup).
The data.table package is meant to be quite efficient, though you might want to look into indexing your data.table if you're going to be querying it multiple times.
Your table:
df <- read.table (text= " ID1 ID2 ID3 ID4
x1_X 0 10 4 7
x2_X 2 12 5 8
x3_X 3 1 3 5
y1_Y 4 13 6 4
y2_Y 5 14 1 9
y3_Y 2 11 1 5
y4_Y 1 1 2 3
z1_Z 1 0 0 5
z2_Z 3 6 7 7", header = T)
Split rownames to get groups:
library(plyr)
df_names <- ldply(strsplit (rownames(df), "_"))
colnames(df_names) <- c ("group1", "group2")
df2 <- cbind (df, df_names)
Create new table:
df_new <- data.frame (matrix(nrow = length(unique (df2$group2)),
ncol = ncol(df)))
colnames(df_new) <- colnames(df)
rownames (df_new) <- unique (df_names[["group2"]])
Filling new table with a loop:
for (i in 1:ncol (df_new)) {
for (k in 1:nrow (df_new)) {
col0 <- colnames (df_new)[i]
row0 <- rownames (df_new)[k]
sub0 <- df2 [df2$group2 == row0, c(col0, "group1")]
df_new [k,i] <- sub0 [sub0[1]==max (sub0[1]), 2]
}
}

Resources