Fill = T won't work with single letters (?) [R] - r

I'm using 'fill = T' on a file that has single letters separated by commas:
Pred
1 T,T
2 NA
3 D
4 NA
5 NA
6 T
7 P,B
8 NA
9 NA
using the command:
sift <- read.table("/home/pred.txt", header=F, fill=TRUE, sep=',', stringsAsFactors=F)
Which I was hoping the sift will turn out as:
V1 V2
1 T T
2 <NA>
3 D
4 <NA>
5 <NA>
6 T
7 P B
8 <NA>
9 <NA>
However, it comes out like:
V1
1 T
2 <NA>
3 D
4 <NA>
5 <NA>
6 T
7 P
8 <NA>
9 <NA>
This code works when there are multiple sampleIDs (separated by a comma) in each row - but not for single letters. Does 'fill' work for single letters? Stupid question, I know.

So here is a workaround:
url <- "https://dl.dropboxusercontent.com/s/bjb241s16t63ev8/pred.txt?dl=1&token_hash=AAEBzfCGgoeHgNTvhMSVoZK6qRGrdwwuDZB3h8lWTZNtkA"
df.1 <- read.table(url,header=F,sep=",",fill=T,stringsAsFactors=F)
dim(df.1)
# [1] 149792 1 <-- 149,792 rows and ** 1 ** column
df.2 <- read.table(url,header=F,sep=",",fill=T,stringsAsFactors=F,
col.names=c("V1","V2"))
dim(df.2)
# [1] 149633 2 <-- 149,633 rows and ** 2 ** columns
head(df.2[which(nchar(df.2$V2)>0),])
# V1 V2
# 1000 T T
# 2419 T T
# 3507 T T
# 3766 T D
# 4308 T D
# 4545 T D
read.table(...) creates a data frame with number of columns determined by the first 5 rows. Since the first 5 rows in your file have only 1 column, that's what you get. Evidently, by specifying sep="," you force read.table(...) to add the "extra" data as extra rows.
The workaround explicitly sets the number of columns by specifying column names, which could be anything, as long as length(col.names) = 2.

Related

Split column - but only if it contians one ore more special characters

I found a few questions heading in this direction, but I could not apply the solutions to my specific problem: I have a quite a messy column of a dataframe with addresses. This means, there can be empty cells, numbers, numbers and text combined - and there can be one or more special characters in between.
In a first step, I want to split all values at the first special character. I tried various options that work partially. However, the problem seems to be that some cells don't contain any special characters - causing an error in the function.
For example, the following code puts only the special character in the new column b, but does not really split the columns:
df <- df %>%
separate(address, into = c("a", "b"), sep = "[^[:punct:]]+", remove = FALSE)
So, what ideally I want to achieve is the following: If there is a special character in the cell, split it at the first special character, everything left of the first special character in column a, everything right in column b. If there is no special character, put the whole thing in column a and NA in column b.
Do I have to wrap my code in an ifelse-statement? Or are there any other suggestions?
Thanks!
Edit: as requested, some sample data:
library(dplyr)
test <- as.data.frame(c("2", "97/7", "17/7-8", "7E", "800E/7", "17", "", "0", "2/15", "17+18", "17/7/8", "19", "2/2/4", "9-7/8")) %>%
rename(address = 1)
You can use separate using extra = 'merge' and fill = 'right'
tidyr::separate(test, address, into = c("a", "b"), '[[:punct:]]',
extra = 'merge', fill = 'right', remove = FALSE)
# address a b
#1 2 2 <NA>
#2 97/7 97 7
#3 17/7-8 17 7-8
#4 7E 7E <NA>
#5 800E/7 800E 7
#6 17 17 <NA>
#7 <NA>
#8 0 0 <NA>
#9 2/15 2 15
#10 17+18 17 18
#11 17/7/8 17 7/8
#12 19 19 <NA>
#13 2/2/4 2 2/4
#14 9-7/8 9 7/8
Using strsplit with your regular expression works. We may put it in an lapply to loop over the columns. Using `length<-()` we adjust the lengths of the list elements to their maximum to be able to create a data.frame.
r <- el(lapply(test, strsplit, "[[:punct:]]", perl=TRUE))
as.data.frame(t(sapply(r, `length<-`, max(lengths(r)))))
# V1 V2 V3
# 1 2 <NA> <NA>
# 2 97 7 <NA>
# 3 17 7 8
# 4 7E <NA> <NA>
# 5 800E 7 <NA>
# 6 17 <NA> <NA>
# 7 <NA> <NA> <NA>
# 8 0 <NA> <NA>
# 9 2 15 <NA>
# 10 17 18 <NA>
# 11 17 7 8
# 12 19 <NA> <NA>
# 13 2 2 4
# 14 9 7 8
Similarly we can do it at the first occurrence: We may use sub to replace the first occurrence with something, say "£" and then split it there.
test[] <- lapply(test, sub, pat="[[:punct:]]", rep="£")
r <- el(lapply(test, strsplit, "£"))
as.data.frame(t(sapply(r, `length<-`, max(lengths(r)))))
# V1 V2
# 1 2 <NA>
# 2 97 7
# 3 17 7-8
# 4 7E <NA>
# 5 800E 7
# 6 17 <NA>
# 7 <NA> <NA>
# 8 0 <NA>
# 9 2 15
# 10 17 18
# 11 17 7/8
# 12 19 <NA>
# 13 2 2/4
# 14 9 7/8
Does this work:
library(dplyr)
library(tidyr)
df %>% separate(c1, c('a','b'), sep = '[^A-z0-9_]')
a b
1 ab cd
2 pq rj
3 xy z
4 abcd <NA>
Data used:
df
c1
1 ab$cd
2 pq%rj
3 xy#z
4 abcd

Appending csvs with different column quantities and spellings

Nothing too complicated, it would just be useful to use rbindlist on a large number of csvs where the column names change a little over time (minor spelling changes), the column orders remain the same, and at some point, two additional columns are added to the csvs (which I don't really need).
library(data.table)
csv1 <- data.table("apple" = 1:3, "orange" = 2:4, "dragonfruit" = 13:15)
csv2 <- data.table("appole" = 7:9, "orangina" = 6:8, "dragonificfruit" = 2:4, "pear" = 1:3)
l <- list(csv1, csv2)
When I run
csv_append <- rbindlist(l, fill=TRUE) #which also forces use.names=TRUE
it gives me a data.table with 7 columns
apple orange dragonfruit appole orangina dragonificfruit pear
1: 1 2 13 NA NA NA NA
2: 2 3 14 NA NA NA NA
3: 3 4 15 NA NA NA NA
4: NA NA NA 7 6 2 1
5: NA NA NA 8 7 3 2
6: NA NA NA 9 8 4 3
as opposed to what I want, which is:
V1 V2 V3 V4
1: 1 2 13 NA
2: 2 3 14 NA
3: 3 4 15 NA
4: 7 6 2 1
5: 8 7 3 2
6: 9 8 4 3
which I can use, even though I have to go through the extra step later of renaming the columns back to standard variable names.
If I instead try the default fill=FALSE and use.names=FALSE, it throws an error:
Error in rbindlist(l) :
Item 2 has 4 columns, inconsistent with item 1 which has 3 columns. To fill missing columns use fill=TRUE.
Is there a simple way to manage this, either by forcing fill=TRUE and use.names=FALSE somehow or by omitting the additional columns in the csvs that have them by specifying a vector of columns to append?
If we only need first 3 columns, then drop the rest and bind as usual:
rbindlist(lapply(l, function(i) i[, 1:3]))
# apple orange dragonfruit
# 1: 1 2 13
# 2: 2 3 14
# 3: 3 4 15
# 4: 7 6 2
# 5: 8 7 3
# 6: 9 8 4
Another option, from the comments: we could directly read the files, and set to keep only first 3 columns using fread, then bind:
rbindlist(lapply(filenames, fread, select = c(1:3)))
Here is an option with name matching using phonetic from stringdist. Extract the column names from the list of data.table ('nmlist'), unlist, group using phonetic, get the first element, relist it to the same list structure as 'nmlist', use Map to change the column names of the list of data.table, and then apply rbindlist
library(stringdist)
library(data.table)
nmlist <- lapply(l, names)
nm1 <- unlist(nmlist)
rbindlist(Map(setnames, l, relist(ave(nm1, phonetic(nm1),
FUN = function(x) x[1]), skeleton = nmlist)), fill = TRUE)
-output
# apple orange dragonfruit pear
#1: 1 2 13 NA
#2: 2 3 14 NA
#3: 3 4 15 NA
#4: 7 6 2 1
#5: 8 7 3 2
#6: 9 8 4 3

How can I insert blank rows every 3 existing rows in a data frame?

How can I insert blank rows every 3 existing rows in a data frame?
After a web scraping process I get a dataframe with the information I need, however the final excel format requires that I add a blank row every 3 rows. I have searched the web for help but have not found a solution yet.
With hypothetical data, the structure of my data frame is as follows:
mi_df <- data.frame(
"ID" = rep(1:3,c(3,3,3)),
"X" = as.character(c("a", "a", "a", "b", "b", "b", "c", "c", "c")),
"Y" = seq(1,18, by=2)
)
mi_df
ID X Y
1 1 a 1
2 1 a 3
3 1 a 5
4 2 b 7
5 2 b 9
6 2 b 11
7 3 c 13
8 3 c 15
9 3 c 17
The result I hope for is something like this
ID X Y
1 1 a 1
2 1 a 3
3 1 a 5
4
5 2 b 7
6 2 b 9
7 2 b 11
8
9 3 c 13
10 3 c 15
11 3 c 17
If the indices of a data frame contain NA, then the output will have NA rows. So my goal is to create a vector like 1 2 3 NA 4 5 6 NA ... and set it as the indices of mi_df.
cut <- rep(1:(nrow(mi_df)/3), each = 3)
mi_df[sapply(split(1:nrow(mi_df), cut), c, NA), ]
# ID X Y
# 1 1 a 1
# 2 1 a 3
# 3 1 a 5
# NA NA <NA> NA
# 4 2 b 7
# 5 2 b 9
# 6 2 b 11
# NA.1 NA <NA> NA
# 7 3 c 13
# 8 3 c 15
# 9 3 c 17
# NA.2 NA <NA> NA
If nrow(mi_df) is not a multiple of 3, then the following is a general solution:
# Version 1
cut <- rep(1:ceiling(nrow(mi_df)/3), each = 3, len = nrow(mi_df))
mi_df[Reduce(c, lapply(split(1:nrow(mi_df), cut), c, NA)), ]
# Version 2
cut <- rep(1:ceiling(nrow(mi_df)/3), each = 3, len = nrow(mi_df))
mi_df[Reduce(function(x, y) c(x, NA, y), split(1:nrow(mi_df), cut)), ]
Don't mind the NA in the output because some functions which write data to an excel file have an optional argument controls if NA values are converted to strings or be empty. E.g.
library(openxlsx)
write.xlsx(df, "test.xlsx", keepNA = FALSE) # defaults to FALSE
tmp <- split(mi_df, rep(1:(nrow(mi_df) / 3), each = 3))
# or split(mi_df, ggplot2::cut_width(seq_len(nrow(mi_df)), 3, center = 2))
do.call(rbind, lapply(tmp, function(x) { x[4, ] <- NA; x }))
ID X Y
1.1 1 a 1
1.2 1 a 3
1.3 1 a 5
1.4 NA <NA> NA
2.4 2 b 7
2.5 2 b 9
2.6 2 b 11
2.4.1 NA <NA> NA
3.7 3 c 13
3.8 3 c 15
3.9 3 c 17
3.4 NA <NA> NA
You can make empty rows like you show by assigning an empty character vector ("") instead of NA, but this will convert your columns to character, and I wouldn't recommend it.
My recommendation is somewhat different from all the other answers: don't make a mess of your dataset inside R . Use the existing packages to write to designated rows in an Excel workbook. For example, with the package xlConnect, the method writeWorksheet (called from writeWorksheetToFile ) includes these arguments:
object The workbook to write to data Data to write
sheet The name or index of the sheet to write to
startRow Index of the first row to write to. The default is startRow = 1.
startCol Index of the first column to write to. The default is startCol = 1.
So if you simply set up a loop that writes 3 rows of your data file at a time, then moves the row index down by 4 and writes the next 3 rows, etc., you're all set.
Here's one method.
Splits into list by ID, adds empty row, then binds list back into data frame.
mi_df2 <- do.call(rbind,Map(rbind,split(mi_df,mi_df$ID),rep("",3)))
rownames(mi_df2) <- NULL

Conditionally update Dataframe from second dataframe in R

I have 2 dataframes and would like to use the second to update the first. The problem though is that the second dataframe consists of all entries but either with different amounts of data (as shown below)
DF1 DF2 DF3
X Y X Y X Y
1 A 1 B 1 B
2 <NA> 2 B 2 B
3 <NA> 3 C --> 3 C
4 D 4 <NA> 4 D
5 E 5 <NA> 5 E
It should be a simple update query where entries in DF1 updates if DF2 is not NA
I first thought of removing the NA from the list
DF2sub <- subset(DF2,!is.na(Y)
DF3 <- transform(DF1, Y = DF2sub$Y[match(X,DF2sub$X)])
but the resulting code does the following
DF3
X Y
X Y
1 B
2 B
3 C
4 <NA>
5 <NA>
You can directly use the which function to obtain the indices of the NA and not NA values and map it together. like this.
DF3 <- rbind(DF2[which(!is.na(DF2$Y)),],DF1[which(is.na(DF2$Y)),])
Hope this solves your issue.

Need help in data manipulation in R [duplicate]

This question already has answers here:
Split data frame string column into multiple columns
(16 answers)
Closed 6 years ago.
i have a dataframe with 2 columns id, cat_list
id cat_list
1 A
2 A|B
3 E|F|G
4 I
5 P|R|T|Z
i want to achieve the below using R code.
id cat_list1 cat_list2 cat_list3 cat_list4
1 A
2 A B
3 E F G
4 I
5 P R T Z
tidyr::separate is handy:
library(tidyr)
df %>% separate(cat_list, into = paste0('cat_list', 1:4), fill = 'right')
## id cat_list1 cat_list2 cat_list3 cat_list4
## 1 1 A <NA> <NA> <NA>
## 2 2 A B <NA> <NA>
## 3 3 E F G <NA>
## 4 4 I <NA> <NA> <NA>
## 5 5 P R T Z
We can use cSplit. Here, we don't need to worry to about the number of splits as it will automatically detect it.
library(splitstackshape)
cSplit(df1, "cat_list", "|")
# id cat_list_1 cat_list_2 cat_list_3 cat_list_4
#1: 1 A NA NA NA
#2: 2 A B NA NA
#3: 3 E F G NA
#4: 4 I NA NA NA
#5: 5 P R T Z
NOTE: It may be better to fill with NA rather than ''.

Resources