R code - split column each specific row values [duplicate] - r

how can I split the following data.frame
df <- data.frame(var1 = c("a", 1, 2, 3, "a", 1, 2, 3, 4, 5, 6, "a", 1, 2), var2 = 1:14)
into lists of / groups of
a 1
1 2
2 3
3 4
a 5
1 6
2 7
3 8
4 9
5 10
6 11
a 12
1 13
2 14
So basically, value "a" in column 1 is the tag / identifier I want to split the data frame on. I know about the split function but that means I have to add another column and since, as can be seen from my example, the size of the groups can vary I do not know how to automatically create such a dummy column to fit my needs.
Any ideas on that?
Cheers,
Sven

You could find which values of the indexing vector equal "a", then create a grouping variable based on that and then use split.
df[,1] == "a"
# [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
#[13] FALSE FALSE
cumsum(df[,1] == "a")
# [1] 1 1 1 1 2 2 2 2 2 2 2 3 3 3
split(df, cumsum(df[,1] == "a"))
#$`1`
# var1 var2
#1 a 1
#2 1 2
#3 2 3
#4 3 4
#
#$`2`
# var1 var2
#5 a 5
#6 1 6
#7 2 7
#8 3 8
#9 4 9
#10 5 10
#11 6 11
#
#$`3`
# var1 var2
#12 a 12
#13 1 13
#14 2 14

You could create a loop that loops through the entire first column of the data frame and saves the positions of non-numeric characters in a vector. Thus, you'd have something like:
data <- df$var1 #this gives you a vector of the values you'll sort through
positions <- c()
for (i in seq(1:length(data))){
if (is.numeric(data[i]) == TRUE) {
#nothing
}
else positions <- append(positions, i) #saves the positions of the non-numeric characters
}
With those positions, you shouldn't have a problem accessing splitting up the data frame from there. It's just a matter of using sequences between the values in the position vector.

Related

How can I insert blank rows every 3 existing rows in a data frame?

How can I insert blank rows every 3 existing rows in a data frame?
After a web scraping process I get a dataframe with the information I need, however the final excel format requires that I add a blank row every 3 rows. I have searched the web for help but have not found a solution yet.
With hypothetical data, the structure of my data frame is as follows:
mi_df <- data.frame(
"ID" = rep(1:3,c(3,3,3)),
"X" = as.character(c("a", "a", "a", "b", "b", "b", "c", "c", "c")),
"Y" = seq(1,18, by=2)
)
mi_df
ID X Y
1 1 a 1
2 1 a 3
3 1 a 5
4 2 b 7
5 2 b 9
6 2 b 11
7 3 c 13
8 3 c 15
9 3 c 17
The result I hope for is something like this
ID X Y
1 1 a 1
2 1 a 3
3 1 a 5
4
5 2 b 7
6 2 b 9
7 2 b 11
8
9 3 c 13
10 3 c 15
11 3 c 17
If the indices of a data frame contain NA, then the output will have NA rows. So my goal is to create a vector like 1 2 3 NA 4 5 6 NA ... and set it as the indices of mi_df.
cut <- rep(1:(nrow(mi_df)/3), each = 3)
mi_df[sapply(split(1:nrow(mi_df), cut), c, NA), ]
# ID X Y
# 1 1 a 1
# 2 1 a 3
# 3 1 a 5
# NA NA <NA> NA
# 4 2 b 7
# 5 2 b 9
# 6 2 b 11
# NA.1 NA <NA> NA
# 7 3 c 13
# 8 3 c 15
# 9 3 c 17
# NA.2 NA <NA> NA
If nrow(mi_df) is not a multiple of 3, then the following is a general solution:
# Version 1
cut <- rep(1:ceiling(nrow(mi_df)/3), each = 3, len = nrow(mi_df))
mi_df[Reduce(c, lapply(split(1:nrow(mi_df), cut), c, NA)), ]
# Version 2
cut <- rep(1:ceiling(nrow(mi_df)/3), each = 3, len = nrow(mi_df))
mi_df[Reduce(function(x, y) c(x, NA, y), split(1:nrow(mi_df), cut)), ]
Don't mind the NA in the output because some functions which write data to an excel file have an optional argument controls if NA values are converted to strings or be empty. E.g.
library(openxlsx)
write.xlsx(df, "test.xlsx", keepNA = FALSE) # defaults to FALSE
tmp <- split(mi_df, rep(1:(nrow(mi_df) / 3), each = 3))
# or split(mi_df, ggplot2::cut_width(seq_len(nrow(mi_df)), 3, center = 2))
do.call(rbind, lapply(tmp, function(x) { x[4, ] <- NA; x }))
ID X Y
1.1 1 a 1
1.2 1 a 3
1.3 1 a 5
1.4 NA <NA> NA
2.4 2 b 7
2.5 2 b 9
2.6 2 b 11
2.4.1 NA <NA> NA
3.7 3 c 13
3.8 3 c 15
3.9 3 c 17
3.4 NA <NA> NA
You can make empty rows like you show by assigning an empty character vector ("") instead of NA, but this will convert your columns to character, and I wouldn't recommend it.
My recommendation is somewhat different from all the other answers: don't make a mess of your dataset inside R . Use the existing packages to write to designated rows in an Excel workbook. For example, with the package xlConnect, the method writeWorksheet (called from writeWorksheetToFile ) includes these arguments:
object The workbook to write to data Data to write
sheet The name or index of the sheet to write to
startRow Index of the first row to write to. The default is startRow = 1.
startCol Index of the first column to write to. The default is startCol = 1.
So if you simply set up a loop that writes 3 rows of your data file at a time, then moves the row index down by 4 and writes the next 3 rows, etc., you're all set.
Here's one method.
Splits into list by ID, adds empty row, then binds list back into data frame.
mi_df2 <- do.call(rbind,Map(rbind,split(mi_df,mi_df$ID),rep("",3)))
rownames(mi_df2) <- NULL

Subsetting values not adding up in R [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I have a dataframe (df) in R. All columns are character class.
> dim(df)
[1] 1000 6
I'm trying to remove rows where df$entry == c("7795").
entries_to_remove <- subset(df, entry == c("7795"))
> dim(entries_to_remove)
[1] 35 6
So as you can see above, I have 35 entries to remove from the data frame. However, when I go to remove these using subset, it doesn't remove the correct amount:
entries_to_remove <- subset(df, entry != c("7795"))
> dim(entries_to_remove)
[1] 648 6
The above command was supposed to remove 35 entries, but instead it removed 352. Does anyone know why this might be happening?
Here's another solution, which takes up just one line:
df[-which(grepl("7995", apply(df, 1, paste0, collapse = " "))),]
RESULT:
v1 entry1 entry2 entry3
2 2 5 5 2
3 3 2 4 2
4 4 2 3 1
6 6 1 2 1
7 7 2 4 4
8 8 4 5 5
9 9 5 1 5
DATA:
set.seed(121)
df <- data.frame(
v1 = 1:10,
entry1 = c(sample(1:5, 9, replace = T), 7995),
entry2 = c(sample(1:5, 4), 7995, sample(1:5, 5)),
entry3 = c(7995, sample(1:5, 9, replace = T))
)
df[2:4] <- lapply(df[2:4], as.character) # convert to character, as in your data
df
v1 entry1 entry2 entry3
1 1 1 2 7995
2 2 5 5 2
3 3 2 4 2
4 4 2 3 1
5 5 3 7995 2
6 6 1 2 1
7 7 2 4 4
8 8 4 5 5
9 9 5 1 5
10 10 7995 3 5
The above solutions didn't work, I do not think the issue is with NA. However, I solved the problem myself. It is a workaround but it worked:
# list the row numbers for the entries to remove
row_remove <- rownames(entries_to_remove )
# make a list of all the row numbers
all_rows <- 1:dim(df)[1]
# create a vector with only the rows to keep
subset_row <- all_rows[!(all_rows%in%row_remove)]
# subset the dataframe with these rows
df<- df[subset_row,]
The issue has to do with NAs, some of the other solutions will work, but the easiest and I think most inutive is just to use %in% rather than ==
entries_to_remove <- subset(df, !(entry %in% c("7795")))
entries_to_remove <- subset(df, entry %in% c("7795"))
This should explain whats happening. Notice how the ==, returns NA rather than FALSE.
> c( 5, 6, 7) == 5
[1] TRUE FALSE FALSE
> c( 5, 6, 7 , NA) == 5
[1] TRUE FALSE FALSE NA
> c( 5, 6, 7 , NA) %in% 5
[1] TRUE FALSE FALSE FALSE
and you can't subset using an NA

Count changes to contents of a character vector [duplicate]

This question already has answers here:
Create group number for contiguous runs of equal values
(4 answers)
Closed 7 years ago.
I have a data_frame where a character variable x changes in time. I want to count the number of times it changes, and fill a new vector with this count.
df <- data_frame(
x = c("a", "a", "b", "b", "c", "b"),
wanted = c(1, 1, 2, 2, 3, 4)
)
x wanted
1 a 1
2 a 1
3 b 2
4 b 2
5 c 3
6 b 4
This is similar to, but different from rle(df$x), which would return
Run Length Encoding
lengths: int [1:4] 2 2 1 1
values : chr [1:4] "a" "b" "c" "b"
I could try to rep() that output. I have also tried this, which is awfully close, but not for reasons I can't figure out immediately:
df %>% mutate(
try_1 = cumsum(ifelse(x == lead(x) | is.na(lead(x)), 1, 0))
)
Source: local data frame [6 x 3]
x wanted try_1
1 a 1 1
2 a 1 1
3 b 2 2
4 b 2 2
5 c 3 2
6 b 4 3
It seems like there should be a function that does this directly, that I just haven't found in my experience.
Try this dplyr code:
df %>%
mutate(try_1 = cumsum(ifelse(x != lag(x) | is.na(lag(x)), 1, 0)))
x wanted try_1
1 a 1 1
2 a 1 1
3 b 2 2
4 b 2 2
5 c 3 3
6 b 4 4
Yours was saying: increment the count if a value is the same as the following row's value, or if the following row's value is NA.
This says: increment the count if the variable on this row either is different than the one on the previous row, or if there wasn't one on the previous row (e.g., row 1).
You can try
library(data.table) #data.table_1.9.5
setDT(df)[, wanted := rleid(x)][]
# x wanted
#1: a 1
#2: a 1
#3: b 2
#4: b 2
#5: c 3
#6: b 4
Or a base R option would be
inverse.rle(within.list(rle(as.character(df$x)),
values<- seq_along(values)))
#[1] 1 1 2 2 3 4
data
df <- data.frame(x=c("a", "a", "b", "b", "c", "b"))

Splitting a data frame in R - Missing block column [duplicate]

how can I split the following data.frame
df <- data.frame(var1 = c("a", 1, 2, 3, "a", 1, 2, 3, 4, 5, 6, "a", 1, 2), var2 = 1:14)
into lists of / groups of
a 1
1 2
2 3
3 4
a 5
1 6
2 7
3 8
4 9
5 10
6 11
a 12
1 13
2 14
So basically, value "a" in column 1 is the tag / identifier I want to split the data frame on. I know about the split function but that means I have to add another column and since, as can be seen from my example, the size of the groups can vary I do not know how to automatically create such a dummy column to fit my needs.
Any ideas on that?
Cheers,
Sven
You could find which values of the indexing vector equal "a", then create a grouping variable based on that and then use split.
df[,1] == "a"
# [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
#[13] FALSE FALSE
cumsum(df[,1] == "a")
# [1] 1 1 1 1 2 2 2 2 2 2 2 3 3 3
split(df, cumsum(df[,1] == "a"))
#$`1`
# var1 var2
#1 a 1
#2 1 2
#3 2 3
#4 3 4
#
#$`2`
# var1 var2
#5 a 5
#6 1 6
#7 2 7
#8 3 8
#9 4 9
#10 5 10
#11 6 11
#
#$`3`
# var1 var2
#12 a 12
#13 1 13
#14 2 14
You could create a loop that loops through the entire first column of the data frame and saves the positions of non-numeric characters in a vector. Thus, you'd have something like:
data <- df$var1 #this gives you a vector of the values you'll sort through
positions <- c()
for (i in seq(1:length(data))){
if (is.numeric(data[i]) == TRUE) {
#nothing
}
else positions <- append(positions, i) #saves the positions of the non-numeric characters
}
With those positions, you shouldn't have a problem accessing splitting up the data frame from there. It's just a matter of using sequences between the values in the position vector.

Split data.frame by value

how can I split the following data.frame
df <- data.frame(var1 = c("a", 1, 2, 3, "a", 1, 2, 3, 4, 5, 6, "a", 1, 2), var2 = 1:14)
into lists of / groups of
a 1
1 2
2 3
3 4
a 5
1 6
2 7
3 8
4 9
5 10
6 11
a 12
1 13
2 14
So basically, value "a" in column 1 is the tag / identifier I want to split the data frame on. I know about the split function but that means I have to add another column and since, as can be seen from my example, the size of the groups can vary I do not know how to automatically create such a dummy column to fit my needs.
Any ideas on that?
Cheers,
Sven
You could find which values of the indexing vector equal "a", then create a grouping variable based on that and then use split.
df[,1] == "a"
# [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
#[13] FALSE FALSE
cumsum(df[,1] == "a")
# [1] 1 1 1 1 2 2 2 2 2 2 2 3 3 3
split(df, cumsum(df[,1] == "a"))
#$`1`
# var1 var2
#1 a 1
#2 1 2
#3 2 3
#4 3 4
#
#$`2`
# var1 var2
#5 a 5
#6 1 6
#7 2 7
#8 3 8
#9 4 9
#10 5 10
#11 6 11
#
#$`3`
# var1 var2
#12 a 12
#13 1 13
#14 2 14
You could create a loop that loops through the entire first column of the data frame and saves the positions of non-numeric characters in a vector. Thus, you'd have something like:
data <- df$var1 #this gives you a vector of the values you'll sort through
positions <- c()
for (i in seq(1:length(data))){
if (is.numeric(data[i]) == TRUE) {
#nothing
}
else positions <- append(positions, i) #saves the positions of the non-numeric characters
}
With those positions, you shouldn't have a problem accessing splitting up the data frame from there. It's just a matter of using sequences between the values in the position vector.

Resources