I have a question about generating warnings for more than one items at a time in R. Please refer to the following dataframe and codes:
Dataframe dat:
inputs var1 var2
A 1 a 1
B 2 b 3
B 3 b NA
C 4 d NA
C 5 e 4
if (any(duplicated(dat$inputs))==T){
warning(paste("The following inputs: ", dat$inputs[duplicated(dat$inputs)],"is duplicated.",sep=""))
}
As you can see both B and C will be shown in the warning, like:
Warning message:
The following inputs: B is duplicated.The following inputs: C is duplicated.
I'm okay with such warning message output, but it is not ideal. Is there a way to combine the two sentences and make it look like:
Warning message:
The following inputs: B,C are duplicated.
Thanks a lot in advance for your attention and time.
Helene
I couldn't get your code to run, so I made up some/modified your code/data.
dat = read.table(text = "
inputs var1 var2 var3
A 1 a 1
B 2 b 3
B 3 b NA
C 4 d NA
C 5 e 4", header = T)
if (any(b<-duplicated(dat$inputs))){
if (length(c<-unique(dat$inputs[b]))>1) {warning(paste0("The following inputs: ", paste0(c, collapse=", "), " are duplicated."))} else
{warning(paste0("The following input: ", paste0(c, collapse=", "), " is duplicated."))}
}
Warning message:
The following inputs: B, C are duplicated.
Single duplicate
dat = read.table(text = "
inputs var1 var2 var3
A 1 a 1
A 2 b 3
E 3 b NA
C 4 d NA
G 5 e 4", header = T)
Warning message:
The following input: A is duplicated.
Related
I have a script written in R that is ran weekly and produces a csv. I need to add headers over top of some of the column names as they are grouped together.
Header1 Header2
A B C D E F
1 2 3 4 5 6
7 8 9 a b c
In this example ABC columns are under the "Header1" header, and DEF are under the "Header2" header. Obviously this can be done manually but I was curious if there was a package that can do this. "No" is an acceptable answer.
EDIT: should of added that the file can also be a xlsx. Initially I write off most of my files as CSVs since they usually get used by a script again at some point.
It is a bit ugly but you can do on a csv as long as you do not require any merging of cells. I used data.table in my example, but I am pretty sure you can use any other writing function as long as you write the headers with append = FALSE and col.names = FALSE and the data both with TRUE. Reading it back gets a bit ugly but you can skip the first row.
dt <- fread("A B C D E F
1 2 3 4 5 6
7 8 9 a b c")
fwrite(data.table(t(c("Header1", NA, NA, "Header2", NA, NA))), "test.csv", append = FALSE, col.names = FALSE)
fwrite(dt, "test.csv", append = TRUE, col.names = TRUE)
fread("test.csv")
# V1 V2 V3 V4 V5 V6
# 1: Header1 Header2
# 2: A B C D E F
# 3: 1 2 3 4 5 6
# 4: 7 8 9 a b c
fread("test.csv", skip = 1L)
# A B C D E F
# 1: 1 2 3 4 5 6
# 2: 7 8 9 a b c
If you happen to want your header information back you can do something like this. Read the first line, find the positions of the headers and find the headers itself.
headers <- strsplit(readLines("test.csv", n = 1L), ",")[[1]]
which(headers != "")
# [1] 1 4
headers[which(headers != "")]
# [1] "Header1" "Header2"
I'm writing a R function with aggregations using data.table package. My table looks like:
Name1 Name2 Price
A F 6
A D 5
A E 2
B F 4
B D 7
C F 4
C E 2
My function looks like:
MyFun <- function(Master_Table, Desired_Column, Group_By){
Master_Table <- as.data.table(Master_Table)
Master_Table_New <- Master_Table[, (Master_Table$Desired_Column), by=.(Desired_Column$Group_By)]
return(Master_Table_New)
}
I want to calculate df[, .(Group_Median = median(Price), by=.(Name1, Name2)]
But when I apply it into my own function, it keeps giving me errors like: `
Error in `[.data.table`(Master_Table, , .(Med_Group = mean(Master_Table$Desired_Column)), :
column or expression 1 of 'by' or 'keyby' is type NULL. Do not quote column names. Usage: DT[,sum(colC),by=list(colA,month(colB))] `
or:
Error in `[.data.table`(Master_Table, , .(Med_Group = mean(Master_Table$Desired_Column)), :
column or expression 1 of 'by' or 'keyby' is type NULL. Do not quote column names. Usage: DT[,sum(colC),by=list(colA,month(colB))]
This would be the very first step of my whole work. If anyone knows anything about this, please let me know, any help would be appreciated!
The function should be written as:
MyFun <- function(Master_Table, Desired_Column, Group_By){
Master_Table[, sapply(.SD, mean), .SDcols = Desired_Column, by=Group_By]
}
#Have a close watch here how Group_By is prepared to provide multiple columns.
MyFun(DT, "Price", "Name1,Name2")
# Name1 Name2 V1
# 1: A F 6
# 2: A D 5
# 3: A E 2
# 4: B F 4
# 5: B D 7
# 6: C F 4
# 7: C E 2
Data
DT <- read.table(text =
"Name1 Name2 Price
A F 6
A D 5
A E 2
B F 4
B D 7
C F 4
C E 2",
header = TRUE, stringsAsFactors = FALSE)
setDT(DT)
I have a space delimited file and some columns are blank, so we end up having multiple spaces, and fread fails with error. But read.table works fine. See example:
library(data.table)
# R version 3.4.2 (2017-09-28)
# data.table_1.10.4-3
fread("A B C D
1 2 3
4 5 6 7", sep = " ", header = TRUE)
Error in fread("A B C D\n1 2 3\n4 5 6 7") :
Expected sep (' ') but new line, EOF (or other non printing character) ends field 2 when detecting types from point 0: 1 2 3
read.table(text ="A B C D
1 2 3
4 5 6 7", sep = " ", header = TRUE)
# A B C D
# 1 1 2 NA 3
# 2 4 5 6 7
How do we read using fread, I tried setting sep = " " and na.string = "", didn't help.
In fread function, by default strip.white is set to TRUE, meaning leading trailing spaces are removed. That is useful to read files with fixed width or with irregular number of spaces as separator.
Whereas in read.table strip.white by default is set to FALSE.
fread("A B C D
1 2 3
4 5 6 7", sep = " ", header = TRUE, strip.white = FALSE)
# A B C D
# 1: 1 2 NA 3
# 2: 4 5 6 7
Note: Providing self-answer as I couldn't find relevant post, also this tripped me over once and twice.
Edit: This doesn't work anymore for data.table_1.12.2, related GitHub Issue.
I'm following the instructions here Dummy variables from a string variable to try to convert a column of strings (words separated by spaces) into dummy variables (0-1 to indicate a word being notused/used in the string in that row) using concat.split.expanded but get a bunch of the below error:
In lapply(listOfValues, as.integer) : NAs introduced by coercion
preceded by one of
Error in seq.default(min(vec), max(vec)) : 'from' cannot be NA, NaN or infinite
I'm pretty sure there aren't any NAs in the column to be converted, let alone that many. Not sure how to go about fixing this. Thanks!
command I've been running that produces the problem:
concat.split.expanded(dataset, "stringvarname", sep = " ", mode = "binary", drop = false)
Produces the problem with or without fill=
You need to specify that you are splitting concatenated strings ("var2" in the sample data below) and not numeric values concatenated as strings ("var3" in the sample data below).
Here's an example that reproduces your error and shows the working solution:
df = data.frame(var1 = 1:2, var2 = c("a b c", "a c d"), var3 = c("1 2 3", "1 2 5"))
library(splitstackshape)
cSplit_e(df, "var3", sep = " ")
# var1 var2 var3 var3_1 var3_2 var3_3 var3_4 var3_5
# 1 1 a b c 1 2 3 1 1 1 NA NA
# 2 2 a c d 1 2 5 1 1 NA NA 1
## Will give you an error
cSplit_e(df, "var2", sep = " ")
# Error in seq.default(min(vec), max(vec)) :
# 'from' cannot be NA, NaN or infinite In addition: Warning messages:
# 1: In lapply(listOfValues, as.integer) : NAs introduced by coercion
# 2: In lapply(listOfValues, as.integer) : NAs introduced by coercion
cSplit_e(df, "var2", sep = " ", type = "character")
# var1 var2 var3 var2_a var2_b var2_c var2_d
# 1 1 a b c 1 2 3 1 1 1 NA
# 2 2 a c d 1 2 5 1 NA 1 1
Why? cSplit_e uses seq, and seq is for numeric input.
> seq("a", "c")
Error in seq.default("a", "c") : 'from' cannot be NA, NaN or infinite
For dummy dataset
require(data.table)
require(reshape2)
teamid <- c(1,2,3)
member <- c("a,b","","c,g,h")
leader <- c("c", "d,e", "")
dt <- data.table(teamid, member, leader)
Now the dataset looks like this:
teamid member leader
1: 1 a,b c
2: 2 d,e
3: 3 c,g,h
3 Columns. For each team, they have team members, and team leaders in different column. Teams may have only members without leaders, and vice versa.
The following is my ALMOST desired output:
teamid value leader
1: 1 a FALSE
2: 1 b FALSE
3: 1 c TRUE
4: 1 c TRUE
5: 2 d TRUE
6: 2 e TRUE
7: 3 c FALSE
8: 3 g FALSE
9: 3 h FALSE
I want to have the two columns merged into one, and add a tag if one is a team leader.
I have an ugly solution for this,
dt1 <- dt[, strsplit(member, ","), by = teamid]
dt2 <- dt[, strsplit(leader, ","), by = teamid]
setkey(dt1,teamid)
setkey(dt2,teamid)
dt3 <- merge(dt1,dt2, all = TRUE)
dt4 <- melt(dt3, id = 1, measure = c("V1.x", "V1.y"))
dt5 <- dt4[value!="NA_real"]
dt6 <- dt5[, leader := (variable == "V1.y")][, variable := NULL]
setkey(dt6, teamid)
setnames(dt6,value,member)
Issues:
This solution is not efficency I think, first merge and then melt. So any ideas about other ways to do this?
There're duplicated rows, in row 3 and row 4.
When I tried to change column name, an error came up
setnames(dt6,value,member)
Error in setnames(dt6, value, member) : object 'value' not found
Maybe the most important thing,
When I tried to test on my real dataset, which have more 1million rows, 3 columns the following error occured
merge(df1,df2, all = TRUE)
Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), :
Join results in 238797 rows; more than 142095 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
Any suggestion? Thanks a lot!
Melt first.
result <- melt(dt,id="teamid", variable.name="status", value.name="member")
result <- result[nchar(member)>0,strsplit(member,","),by=list(teamid,status)]
setnames(result,"V1","member")
setkey(result,teamid,status)
result
# teamid status member
# 1: 1 member a
# 2: 1 member b
# 3: 1 leader c
# 4: 2 leader d
# 5: 2 leader e
# 6: 3 member c
# 7: 3 member g
# 8: 3 member h
If you want to get rid of the status column and add a "tag" to the member column, you can do it this way:
result[status=="leader",member:=paste0(member,"*")]
result[,status:=NULL]
result
# teamid member
# 1: 1 a
# 2: 1 b
# 3: 1 c*
# 4: 2 d*
# 5: 2 e*
# 6: 3 c
# 7: 3 g
# 8: 3 h
A slightly simpler approach may be
crew <- dt[, .(strsplit(member, ","))]
crew <- unlist(crew)
leads <- dt[, .(strsplit(leader, ","))]
leads <- unlist(leads)
dt_long <- data.table(people=c(crew, leads),
status = rep(c("crew", "leader"), c(length(crew), length(leader))))
It gives me
people status
1: a crew
2: b crew
3: c crew
4: g crew
5: h crew
6: c leader
7: d leader
8: e leader
You can try a tidyverse solution now
dt %>%
separate_rows(member) %>%
separate_rows(leader) %>%
gather(status, member, -teamid) %>%
distinct() %>%
filter(member != "") %>%
mutate(member=ifelse(status == "leader", paste0(member, "*"), member)) %>%
select(-status)
teamid member
1 1 a
2 1 b
3 3 c
4 3 g
5 3 h
6 1 c*
7 2 d*
8 2 e*