I'm working with a database that has a timestamp, 3 numeric vectors, and a character vector.
Basically, each "set" of data is delineated by a new row. I need each series of rows to save as .csv when the row reads that each column is empty (x = \t\r\n). There's about 370 in my dataset.
For example,
library(dplyr)
data <- data.frame(x1 = 1:4,
x2 = 4:1,
x3 = 3,
x4 = c("text", "no text", "example", "hello"))
new_row <- c("\t\r\n", "\t\r\n", "\t\r\n", "\t\r\n")
data1 <- rbind(data, new_row)
data2 <- data.frame(x1 = 1:4,
x2 = 4:1,
x3 = 4,
x4 = c("text", "no text", "example", "hello"))
data2 <- rbind(data2, new_row)
data3 <- rbind(data1, data2)
view(data3)
This is what my data set looks like (without the timestamp). I need every set of consecutive rows after a row full or \t\r\n to be exported as an individual .csv.
I'm doing text analysis. Each group of rows, with highly variable group size, represents a thread on different subject. I need to analyze these individual threads.
What is the best way to go about doing this? I haven't had this problem before.
ind <- grepl("\t", data3$x4)
ind <- replace(cumsum(ind), ind, -1)
ind
# [1] 0 0 0 0 -1 1 1 1 1 -1
data4 <- split(data3, ind)
data4
# $`-1`
# x1 x2 x3 x4
# 5 \t\r\n \t\r\n \t\r\n \t\r\n
# 10 \t\r\n \t\r\n \t\r\n \t\r\n
# $`0`
# x1 x2 x3 x4
# 1 1 4 3 text
# 2 2 3 3 no text
# 3 3 2 3 example
# 4 4 1 3 hello
# $`1`
# x1 x2 x3 x4
# 6 1 4 4 text
# 7 2 3 4 no text
# 8 3 2 4 example
# 9 4 1 4 hello
The use of -1 was solely to keep the "\t\r\n" rows from being included in each of their respective groups, and we know that cumsum(ind) should start at 0. You can obviously drop the first frame :-)
From here, you can export with
data4 <- data4[-1]
ign <- Map(write.csv, data4, sprintf("file_%03d.csv", seq_along(data4)))
I want to split a dataframe by changing values in the first column and afterward attach the split part in a new column. An example is given below. However, I end up with a list that I can't process back to a handy dataframe.
the desired output should look like df_goal, which is not yet properly formatted.
#data
x <-c(1,2,3)
y <-c(20200101,20200101,20200101)
z <-c(4.5,5,7)
x_name <- "ID"
y_name <- "Date"
z_name <- "value"
df <-data.frame(x,y,z)
names(df) <- c(x_name,y_name,z_name)
#processing
df$date <-format(as.Date(as.character(df$date), format="%Y%m%d"))
df01 <- split(df, f = df$ID)
#goal
a <-c(1)
b <-c(20200101)
c <-c(4.5)
d <-c(2)
e <-c(20200101)
f <-c(5)
g <-c(3)
h <-c(20200101)
i <-c(7)
df_goal <- data.frame(a,b,c,d,e,f,g,h,i)
You can use Reduce and cbind to cbind each row of a data.frame in one row and keep the type of the columns.
Reduce(function(x,y) cbind(x, df[y,]), 2:nrow(df), df[1,])
# ID Date value ID Date value ID Date value
#1 1 20200101 4.5 2 20200101 5 3 20200101 7
#Equivalent for the sample dataset: cbind(cbind(df[1,], df[2,]), df[3,])
or do.call with split:
do.call(cbind, split(df, 1:nrow(df)))
# 1.ID 1.Date 1.value 2.ID 2.Date 2.value 3.ID 3.Date 3.value
#1 1 20200101 4.5 2 20200101 5 3 20200101 7
#Equivalent for the sample dataset: cbind(df[1,], df[2,], df[3,])
In case you have several rows per ID you can try:
x <- split(df, df$ID)
y <- max(unlist(lapply(x, nrow)))
do.call(cbind, lapply(x, function(i) i[1:y,]))
This is a possible solution for your example :
new_df = data.frame(list(df[1,],df[2,],df[3,]))
And if you want to generalize that on a bigger data.frame :
new_list = list()
for ( i in 1:dim(df)[1] ){
new_list[[i]] = df[i,]
}
new_df = data.frame(new_list)
One option could be:
setNames(Reduce(c, asplit(df, 1)), letters[1:Reduce(`*`, dim(df))])
a b c d e f g h i
1.0 20200101.0 4.5 2.0 20200101.0 5.0 3.0 20200101.0 7.0
Maybe you can try the following code
df_goal <- data.frame(t(c(t(df))))
such that
> df_goal
X1 X2 X3 X4 X5 X6 X7 X8 X9
1 1 20200101 4.5 2 20200101 5 3 20200101 7
I have a dataframe with a differing number of names in a cell of a dataframe which I want to replace with corresponding numbers of another dataframe. Afterwards, I want to proceed and calculate the mean and maximum but thats not part of my problem.
df_with_names <-read.table(text="
id names
1 AA,BB
2 AA,CC,DD
3 BB,CC
4 AA,BB,CC,DD
",header=TRUE,sep="")
The dataframe with the correspoding numbers looks like
df_names <-read.table(text="
name number_1 number_2
AA 20 30
BB 12 14
CC 13 29
DD 14 27
",header=TRUE,sep="")
At the end of the first step it should be
id number_1 number_2
1 20,12 30,14
2 20,13,14 30,29,27
3 12,13 14,29
4 20,12,13,14 30,14,29,27
From here I know how to proceed but I don't know how to get there.
I tried to separate the names of each row in a loop into a dataframe and then replace the names but I always fail to get the right column of df_with_names. After a while, I doubt that replace() is the function I am looking for. Who can help?
library(data.table)
dt1 = as.data.table(df_with_names)
dt2 = as.data.table(df_names)
setkey(dt2, name)
dt2[setkey(dt1[, strsplit(as.character(names), split = ","), by = id], V1)][,
lapply(.SD, paste0, collapse = ","), keyby = id]
# id name number_1 number_2
#1: 1 AA,BB 20,12 30,14
#2: 2 AA,CC,DD 20,13,14 30,29,27
#3: 3 BB,CC 12,13 14,29
#4: 4 AA,BB,CC,DD 20,12,13,14 30,14,29,27
The above first splits the names along the comma in the first data.table, then joins that with the second one (after setting keys appropriately) and collapses all of the resulting columns back with a comma.
Another all in one:
data2match <- strsplit(df_with_names$names, ',')
lookup <- function(lookfor, in_df, return_col, search_col=1) {
in_df[, return_col][match(lookfor, in_df[, search_col])]
}
output <-
# for each number_x column....
sapply(names(df_names)[-1],
function(y) {
# for each set of names
sapply(data2match,
function(x) paste(sapply(x, lookup, df_names,
y, USE.NAMES=F), collapse=','))
})
data.frame(id=1:nrow(output), output)
Produces:
id number_1 number_2
1 1 20,12 30,14
2 2 20,13,14 30,29,27
3 3 12,13 14,29
4 4 20,12,13,14 30,14,29,27
Note: make sure both dataframes are ordered by id otherwise you may see unexpected results
listing <- df_with_names
listing <- strsplit(as.character(listing$names),",")
col1 <- lapply(listing, function(x) df_names[(df_names[[1]] %in% x),2])
col2 <- lapply(listing, function(x) df_names[(df_names[[1]] %in% x),3])
col1 <- unlist(lapply(col1, paste0, collapse = ","))
col2 <- unlist(lapply(col2, paste0, collapse = ","))
data.frame(number_1 = col1, number_2 = col2 )
number_1 number_2
1 20,12 30,14
2 20,13,14 30,29,27
3 12,13 14,29
4 20,12,13,14 30,14,29,27
I don't like names like "names" or "name", so I went with "nam":
do.call( rbind, # reassembles the individual lists
apply(df_with_names, 1, # for each row in df_with_names
function(x) lapply( # lapply(..., paste) to each column
# Next line will read each comma separated value and
# and match to rows of df_names[] and return cols 2:3
df_names[ df_names$nam %in% scan(text=x[2], what="", sep=",") ,
2:3, drop=FALSE] , # construct packet of text digits
paste0, collapse=",") ) )
number_1 number_2
[1,] "20,12" "30,14"
[2,] "20,13,14" "30,29,27"
[3,] "12,13" "14,29"
[4,] "20,12,13,14" "30,14,29,27"
(I'm surprised that scan(text= ... a factor variable actually succeeded.)
Another method:
df3 = data.frame(id=df1$id,
number_1=as.character(df1$names),
number_2=as.character(df1$names), stringsAsFactors=FALSE)
for(n1 in 1:nrow(df3))
for(n2 in 1:nrow(df2)){
df3[n1,2] = sub(df2[n2,1],df2[n2,2], df3[n1,2] )
df3[n1,3] = sub(df2[n2,1],df2[n2,3], df3[n1,3] )
}
df3
# id number_1 number_2
#1 1 20,12 30,14
#2 2 20,13,14 30,29,27
#3 3 12,13 14,29
#4 4 20,12,13,14 30,14,29,27
I think it would actually be worth your while to rearrange your df_with_names dataset to make things more straight-forward:
spl <- strsplit(as.character(df_with_names$names), ",")
df_with_names <- data.frame(
id=rep(df_with_names$id, sapply(spl, length)),
name=unlist(spl)
)
# id name
#1 1 AA
#2 1 BB
#3 2 AA
#4 2 CC
#5 2 DD
#6 3 BB
#7 3 CC
#8 4 AA
#9 4 BB
#10 4 CC
#11 4 DD
aggregate(
. ~ id,
data=merge(df_with_names, df_names, by="name")[-1],
FUN=function(x) paste(x,collapse=",")
)
# id number_1 number_2
#1 1 20,12 30,14
#2 2 20,13,14 30,29,27
#3 3 12,13 14,29
#4 4 20,12,13,14 30,14,29,27
I have the following data frame and I would like to create a new one that will be like the one below.
ID1 ID2 ID3 ID4
x1_X 0 10 4 7
x2_X 2 12 5 8
x3_X 3 1 3 5
y1_Y 4 13 6 4
y2_Y 5 14 1 9
y3_Y 2 11 1 5
y4_Y 1 1 2 3
z1_Z 1 0 0 5
z2_Z 3 6 7 7
New data frame
ID1 ID2 ID3 ID4
X x3 x2 x2 x2
Y y2 y2 y1 y2
Z z2 z2 z2 z2
Basically the idea is the following:
For each ID I want to find which of the rownames (x1_X,x2_X,x3_X) has the most extreme value and assign this to name X since in the rownames I have subgroups.
My data frame is huge: 1700 columns and 100000 rows.
First we need to split the group and subgroup labels:
grp <- strsplit(row.names(df), "_")
And if performance is an issue, I think data.table is our best choice:
library(data.table)
df$group <- sapply(grp, "[", 2)
subgroup <- sapply(grp, "[", 1)
dt <- data.table(df)
And we now have access to the single line:
result <- dt[,lapply(.SD, function(x) subgroup[.I[which.max(x)]]), by=group]
Which splits the data.table by the character after the underscore (by=group) and then, for every column of the rectangular subset (.SD) we get the index in the sub-rectangle (which.max), and then map it back to the whole data.table (.I), and then extract the relevant subgroup (subgroup).
The data.table package is meant to be quite efficient, though you might want to look into indexing your data.table if you're going to be querying it multiple times.
Your table:
df <- read.table (text= " ID1 ID2 ID3 ID4
x1_X 0 10 4 7
x2_X 2 12 5 8
x3_X 3 1 3 5
y1_Y 4 13 6 4
y2_Y 5 14 1 9
y3_Y 2 11 1 5
y4_Y 1 1 2 3
z1_Z 1 0 0 5
z2_Z 3 6 7 7", header = T)
Split rownames to get groups:
library(plyr)
df_names <- ldply(strsplit (rownames(df), "_"))
colnames(df_names) <- c ("group1", "group2")
df2 <- cbind (df, df_names)
Create new table:
df_new <- data.frame (matrix(nrow = length(unique (df2$group2)),
ncol = ncol(df)))
colnames(df_new) <- colnames(df)
rownames (df_new) <- unique (df_names[["group2"]])
Filling new table with a loop:
for (i in 1:ncol (df_new)) {
for (k in 1:nrow (df_new)) {
col0 <- colnames (df_new)[i]
row0 <- rownames (df_new)[k]
sub0 <- df2 [df2$group2 == row0, c(col0, "group1")]
df_new [k,i] <- sub0 [sub0[1]==max (sub0[1]), 2]
}
}