How can I insert blank rows every 3 existing rows in a data frame? - r

How can I insert blank rows every 3 existing rows in a data frame?
After a web scraping process I get a dataframe with the information I need, however the final excel format requires that I add a blank row every 3 rows. I have searched the web for help but have not found a solution yet.
With hypothetical data, the structure of my data frame is as follows:
mi_df <- data.frame(
"ID" = rep(1:3,c(3,3,3)),
"X" = as.character(c("a", "a", "a", "b", "b", "b", "c", "c", "c")),
"Y" = seq(1,18, by=2)
)
mi_df
ID X Y
1 1 a 1
2 1 a 3
3 1 a 5
4 2 b 7
5 2 b 9
6 2 b 11
7 3 c 13
8 3 c 15
9 3 c 17
The result I hope for is something like this
ID X Y
1 1 a 1
2 1 a 3
3 1 a 5
4
5 2 b 7
6 2 b 9
7 2 b 11
8
9 3 c 13
10 3 c 15
11 3 c 17

If the indices of a data frame contain NA, then the output will have NA rows. So my goal is to create a vector like 1 2 3 NA 4 5 6 NA ... and set it as the indices of mi_df.
cut <- rep(1:(nrow(mi_df)/3), each = 3)
mi_df[sapply(split(1:nrow(mi_df), cut), c, NA), ]
# ID X Y
# 1 1 a 1
# 2 1 a 3
# 3 1 a 5
# NA NA <NA> NA
# 4 2 b 7
# 5 2 b 9
# 6 2 b 11
# NA.1 NA <NA> NA
# 7 3 c 13
# 8 3 c 15
# 9 3 c 17
# NA.2 NA <NA> NA
If nrow(mi_df) is not a multiple of 3, then the following is a general solution:
# Version 1
cut <- rep(1:ceiling(nrow(mi_df)/3), each = 3, len = nrow(mi_df))
mi_df[Reduce(c, lapply(split(1:nrow(mi_df), cut), c, NA)), ]
# Version 2
cut <- rep(1:ceiling(nrow(mi_df)/3), each = 3, len = nrow(mi_df))
mi_df[Reduce(function(x, y) c(x, NA, y), split(1:nrow(mi_df), cut)), ]
Don't mind the NA in the output because some functions which write data to an excel file have an optional argument controls if NA values are converted to strings or be empty. E.g.
library(openxlsx)
write.xlsx(df, "test.xlsx", keepNA = FALSE) # defaults to FALSE

tmp <- split(mi_df, rep(1:(nrow(mi_df) / 3), each = 3))
# or split(mi_df, ggplot2::cut_width(seq_len(nrow(mi_df)), 3, center = 2))
do.call(rbind, lapply(tmp, function(x) { x[4, ] <- NA; x }))
ID X Y
1.1 1 a 1
1.2 1 a 3
1.3 1 a 5
1.4 NA <NA> NA
2.4 2 b 7
2.5 2 b 9
2.6 2 b 11
2.4.1 NA <NA> NA
3.7 3 c 13
3.8 3 c 15
3.9 3 c 17
3.4 NA <NA> NA
You can make empty rows like you show by assigning an empty character vector ("") instead of NA, but this will convert your columns to character, and I wouldn't recommend it.

My recommendation is somewhat different from all the other answers: don't make a mess of your dataset inside R . Use the existing packages to write to designated rows in an Excel workbook. For example, with the package xlConnect, the method writeWorksheet (called from writeWorksheetToFile ) includes these arguments:
object The workbook to write to data Data to write
sheet The name or index of the sheet to write to
startRow Index of the first row to write to. The default is startRow = 1.
startCol Index of the first column to write to. The default is startCol = 1.
So if you simply set up a loop that writes 3 rows of your data file at a time, then moves the row index down by 4 and writes the next 3 rows, etc., you're all set.

Here's one method.
Splits into list by ID, adds empty row, then binds list back into data frame.
mi_df2 <- do.call(rbind,Map(rbind,split(mi_df,mi_df$ID),rep("",3)))
rownames(mi_df2) <- NULL

Related

I want to create a new csv from the existing csv consist of multiple same columns but not sorted data

I have a CSV with these data:
List Rank.A List Rank.B List Rank.C
a 4 a 8 b 3
b 5 e 5 e 9
c 7 f 5 r 1
I want to create a new csv in which there is only a one-column with a name List with a unique value and there is 3 more columns of "Rank.A", "Rank.B", "Rank.C" in same list. Suppose if Rank.A not listed with any row of List than it display blank. I want data in this format
List Rank.A Rank.B Rank.C
a 4 8
b 5 3
c 7
e 5 9
f 5
r 1
Can you please help me in that?
A base R option using split.default (to split your data.frame by columns) and Reduce + merge to combine data into a single data.frame.
Reduce(
function(x, y) merge(x, y, all = TRUE),
split.default(df, rep(1:(ncol(df) / 2), each = 2)))
# List Rank.A Rank.B Rank.C
# 1 a 4 8 NA
# 2 b 5 NA 3
# 3 c 7 NA NA
# 4 e NA 5 9
# 5 f NA 5 NA
# 6 r NA NA 1
Note that this assumes that you always have pairs of columns (List, Rank.x) in your original data.
Sample data
df <- read.table(text =
"List Rank.A List Rank.B List Rank.C
a 4 a 8 b 3
b 5 e 5 e 9
c 7 f 5 r 1", header = T, check.names = F)

Using mapply to set values based on values in other columns

Based on my previous question, I need help with using the mapply function correctly.
x <- data.frame(a = seq(1,3), b = seq(2,4), c = seq(3,5), d = seq(4,6), b2 = seq(5,7), c2 = seq(6,8), d2 = seq(7,9))
# a b c d b2 c2 d2
# 1 2 3 4 5 6 7
# 2 3 4 5 6 7 8
# 3 4 5 6 7 8 9
My goal is to look at the columns b2 to d2 and, based on their values, change the values in columns b to d respectively. I can do this for a single column quite easily:
x[which(x$b2 == 7),][b] <- NA_real_
My problem is that I want this applied across all my columns but I don't know how to convert this single column formula to work on multiple columns. I tried:
onez <- c(2:4)
twoz <- c(5:7)
f <- function(df, ones, twos) {
df[which(df[,twos] == 7),][ones] <- NA_real_
}
mapply(f, df = x, ones = onez, twos = twoz)
But I'm getting error messages (incorrect dimensions etc) and I see that my function is messy but I lack the knowledge how to fix it.
One way to do it is to tell it to:
Get the subset of the data frame with columns 5, 6, 7: x[5:7]
Check from that subset which values satisfy your condition: x[5:7] == 7
Replace those values with NA: ... <- NA
This gives the following,
x[5:7][x[5:7] == 7] <- NA
x
# a b c d b2 c2 d2
#1 1 2 3 4 5 6 NA
#2 2 3 4 5 6 NA 8
#3 3 4 5 6 NA 8 9
If you want the NAs to be replaced at x[2:4], then you can do,
x[2:4][x[5:7] == 7] <- NA
x
# a b c d b2 c2 d2
#1 1 2 3 NA 5 6 7
#2 2 3 NA 5 6 7 8
#3 3 NA 5 6 7 8 9

R - Replace values in a specific even column based on values from a odd specific column - Application to the whole dataframe

My data frame:
data <- data.frame(A = c(1,5,6,8,7), qA = c(1,2,2,3,1), B = c(2,5,6,8,4), qB = c(2,2,1,3,1))
For the case A and qA (= quality A): I want the values assigned to the quality value 1 and 3 are replaced by NA
And the same for the case B and qB
The final data has to be like this:
desired_data <- data.frame(A = c("NA",5,6,"NA","NA"), qA = c(1,2,2,3,1), B = c(2,5,"NA","NA","NA"), qB = c(2,2,1,3,1))
My question is how to perform that?
I have a big dataframe with about 90 columns, so I need code which doesn't require the column names to work properly.
To help, I have this part of code which select columns starting with "q" letter:
data[,grep("^[q]", colnames(data))]
You could just do this...
data[,seq(1,ncol(data),2)][(data[,seq(2,ncol(data),2)]==1)|
(data[,seq(2,ncol(data),2)]==3)] <- NA
data
A qA B qB
1 NA 1 2 2
2 5 2 5 2
3 6 2 NA 1
4 NA 3 NA 3
5 NA 1 NA 1
One solution is to separate in two tables and use vectorisation in base R
data <- data.frame(A = c(1,5,6,8,7), qA = c(1,2,2,3,1), B = c(2,5,6,8,4), qB = c(2,2,1,3,1))
data
#> A qA B qB
#> 1 1 1 2 2
#> 2 5 2 5 2
#> 3 6 2 6 1
#> 4 8 3 8 3
#> 5 7 1 4 1
quality <- data[,grep("^[q]", colnames(data))]
data2 <- data[,setdiff(colnames(data), names(quality))]
data2[quality == 1 | quality == 3] <- NA
data2
#> A B
#> 1 NA 2
#> 2 5 5
#> 3 6 NA
#> 4 NA NA
#> 5 NA NA

R Compare Columns across Dataframes to Match Values

I have two dataframes, looking at houses (n=6) and certain dates (n=22).
ORIGINAL is the original dataset. It contains 38 observations on 5 variables. Not all houses have all the dates listed, and vice versa, leading to errors in calculations with different length variables.
SAMPLE is a new empty dataset. It contains 132 (6 x 22) observations on the same 5 variables. Now there is an observation for every household for every date.
House Day Mongoose Fruit Elephant
A 1 40 7 0.6
A 6 32 12 4.2
B 2 50 3 4.0
B 4 51 4 8.6
B 6 8 7 12.1
C 2 12 8 13.0
I am trying to fill in the rest of SAMPLE by asking R to compare HouseID and Date between the two dataframes; if they match, the rest of the variables (mongoose, fruit, elephant) should be copied over for that observation.
I tried this to no avail...
for(i in 1:nrow(original))
{
if ((sample$Day == original$Day) && (sample$House == original$House))
{
sample$Mongoose[i] <- original$Mongoose[i]
sample$Fruit[i] <- original$Fruit[i]
sample$Elephant[i] <- original$Elephant[i]
}
}
The following results:
I get the following 3 errors in sequence
In sample$Day == test$Day : longer object length is not a multiple of
shorter object length
In is.na(e1) | is.na(e2) :longer object length is not a multiple of
shorter object length
In ==.default(sample$House, test$House) :longer object length is
not a multiple of shorter object length
The data DOES copy over, but incorrectly. All the values get transferred to the A house and sequential date, rather than the appropriate house and date.
I.e., it looks like this
House Day Mongoose Fruit Elephant
A 1 40 7 0.6
A 2 50 3 4.0
A 3 51 4 8.6
A 4 8 7 12.1
A 5 12 8 13.0
A 6 32 12 4.2
B 1
B 2
B 3 [...]
When it should (in essence) look like this:
House Day Mongoose Fruit Elephant
A 1 40 7 0.6
A 2
A 3
A 4
A 5
A 6 32 12 4.2 [rest of A houses have no data]
B 1
B 2 50 3 4.0
B 3
B 4 51 4 8.6
B 5
B 6 8 7 12.1 [rest of B houses have no data]
C 1
C 2 12 8 13.0
Please advise; I will eventually have to extend this technique to look at a sample dataset with 198K entries, and a test dataset with 115K.
Thanks!
Sounds to me like this should work:
merge(sample, original, by = c("House", "Day"), all.x = TRUE)
But hard to tell without a reproducible example. You may also want to look into dplyr::left_join(). That is, assuming your data looks like the following:
sample <- data.frame(House = rep(c("A", "B", "C"), each = 6),
Day = rep(1:6, 3))
# head(sample)
# House Day
# 1 A 1
# 2 A 2
# 3 A 3
# 4 A 4
# 5 A 5
# 6 A 6
original <- data.frame(House = c("A", "A", "B", "B", "C"),
Day = c(1, 6, 2, 4, 2),
Mongoose = c(40, 32, 50, 51, 8),
Fruit = c(7, 12, 3, 4, 8),
Elephant = c(0.6, 4.2, 4.0, 8.6, 12.1))
# head(original)
# House Day Mongoose Fruit Elephant
# 1 A 1 40 7 0.6
# 2 A 6 32 12 4.2
# 3 B 2 50 3 4.0
# 4 B 4 51 4 8.6
# 5 C 2 8 8 12.1
We obtain:
# head(merge(sample, original, by = c("House", "Day"), all.x = TRUE))
# House Day Mongoose Fruit Elephant
# 1 A 1 40 7 0.6
# 2 A 2 NA NA NA
# 3 A 3 NA NA NA
# 4 A 4 NA NA NA
# 5 A 5 NA NA NA
# 6 A 6 32 12 4.2
It could be a small tweak, look at this this line of your original code:
if ((sample$Day == original$Day) && (sample$House == original$House))
See if you can change it to this:
if ((sample$Day[i] == original$Day[i]) && (sample$House[i] == original$House[i]))
Because:
You are using a for loop with an i variable,
which you do very well with lines such as sample$Mongoose[i] <- original$Mongoose[i]
but in your example it seems the if statement is not actually making use of the i variable
so we revise it to make use of i so it will be comparing specifically that observation/rows's sample$Day with that observation/rows's original$Day, and the same for sample$House vs original$House

Removing rows with NA in R [duplicate]

This question already has answers here:
Remove rows with all or some NAs (missing values) in data.frame
(18 answers)
Closed 5 years ago.
I have a dataframe with 2500 rows. A few of the rows have NAs (an excessive number of NAs), and I want to remove those rows.
I've searched the SO archives, and come up with this as the most likely solution:
df2 <- df[df[, 12] != NA,]
But when I run it and look at df2, all I see is a screen full of NAs (and s).
Any suggestions?
Depending on what you're looking for, one of the following should help you on your way:
Some sample data to start with:
mydf <- data.frame(A = c(1, 2, NA, 4), B = c(1, NA, 3, 4),
C = c(1, NA, 3, 4), D = c(NA, 2, 3, 4),
E = c(NA, 2, 3, 4))
mydf
# A B C D E
# 1 1 1 1 NA NA
# 2 2 NA NA 2 2
# 3 NA 3 3 3 3
# 4 4 4 4 4 4
If you wanted to remove rows just according to a few specific columns, you can use complete.cases or the solution suggested by #SimonO101 in the comments. Here, I'm removing rows which have an NA in the first column.
mydf[complete.cases(mydf$A), ]
# A B C D E
# 1 1 1 1 NA NA
# 2 2 NA NA 2 2
# 4 4 4 4 4 4
mydf[!is.na(mydf[, 1]), ]
# A B C D E
# 1 1 1 1 NA NA
# 2 2 NA NA 2 2
# 4 4 4 4 4 4
If, instead, you wanted to set a threshold--as in "keep only the rows that have fewer than 2 NA values" (but you don't care which columns the NA values are in--you can try something like this:
mydf[rowSums(is.na(mydf)) < 2, ]
# A B C D E
# 3 NA 3 3 3 3
# 4 4 4 4 4 4
On the other extreme, if you want to delete all rows that have any NA values, just use complete.cases:
mydf[complete.cases(mydf), ]
# A B C D E
# 4 4 4 4 4 4

Resources