I have a problem with special characters in a column of a table.
Here is an example of the data:
structure(list(shipType = structure(c(1L, 3L, 1L, 2L, 4L), .Label = c("CARGO",
"FISHING", "TOWING_LONG_WIDE", "UNKNOWN"), class = "factor"),
shipCargo = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "UNDEFINED", class = "factor"),
destination = structure(c(3L, 1L, 2L, 4L, 5L), .Label = c("\\KORSOR ;.,NA,.\\",
"LEHTMA", "RIGA", "TALLIN", "VYBORG"), class = "factor"),
eta = structure(c(1L, 2L, 5L, 3L, 4L), .Label = c("01/01 00:00 UTC",
"01/01 09:00 UTC", "24/12 16:00 UTC", "26/12 07:00 UTC",
"30/12 16:00 UTC"), class = "factor"), imo = structure(c(3L,
5L, 1L, 4L, 2L), .Label = c("7101891", "7406318", "9066045",
"9158185", "Russia"), class = "factor"), callsign = structure(c(5L,
1L, 2L, 3L, 4L), .Label = c("12", "UALB", "UBYK8", "UFPC",
"UICC"), class = "factor"), country = structure(c(2L, 1L,
2L, 2L, 2L), .Label = c("2014-12-29", "Russia"), class = "factor"),
month = c(12L, 1L, 12L, 12L, 12L), date = structure(c(2L,
1L, 2L, 2L, 2L), .Label = c("", "2014-12-29"), class = "factor"),
week = c(1L, NA, 1L, 1L, 1L), X = c(NA, NA, NA, NA, NA)), .Names = c("shipType",
"shipCargo", "destination", "eta", "imo", "callsign", "country",
"month", "date", "week", "X"), class = "data.frame", row.names = c(NA,
-5L))
As you can see on the second row, there is a problem in the column "destination" when reading the file with the following code
data <- read.table(file, header=T, fill=T, sep=",")
I have tried different things, such as: exporting with quotes and without headers
data <- read.table(file, sep=",", fill=T, head=F, quote="")
and then removing the first line (the actual headers that are in the table...) and adding one more time these headers
data <- data[-1,]
colnames(data)<-c( "shipType", "shipCargo","destination","eta","imo","callsign", "country","month","date","week")
It looks better, but there are a lot of special characters and it will be time consuming / source of errors (I have lot of tables..) to edit.
Is there a way to avoid the columns to be messed up when importing the file?
Thank you!
Related
I have an excel file like this:
which I tried to read by using:
library(xlsx)
df <- read.xlsx("2021.xlsx", sheetIndex = 1)
However, I obtained a result that I do not like very much
> dput(df)
structure(list(Twitter = structure(c(3L, 1L, 1L, 2L, 2L), .Label = c("Jack",
"John", "User"), class = "factor"), NA. = structure(c(5L, 1L,
3L, 4L, 2L), .Label = c("Hello world", "Hello!", "I'm a text",
"I'm an example", "Tweet"), class = "factor"), NA..1 = structure(c(3L,
1L, 1L, 2L, 2L), .Label = c("44293", "44294", "Date"), class = "factor"),
NA..2 = structure(c(3L, 1L, 1L, 2L, 2L), .Label = c("0.490277777777778",
"0.552083333333333", "Hour"), class = "factor"), NA..3 = structure(c(3L,
1L, 1L, 2L, 2L), .Label = c("3", "4", "x"), class = "factor"),
NA..4 = structure(c(3L, 2L, 2L, 1L, 1L), .Label = c("6",
"7", "y"), class = "factor"), NA..5 = structure(c(3L, 2L,
2L, 1L, 2L), .Label = c("no", "yes", "z"), class = "factor")), class = "data.frame", row.names =
c(NA, -5L))
i.e.,
> df
Twitter NA. NA..1 NA..2 NA..3 NA..4 NA..5
1 User Tweet Date Hour x y z
2 Jack Hello world 44293 0.490277777777778 3 7 yes
3 Jack I'm a text 44293 0.490277777777778 3 7 yes
4 John I'm an example 44294 0.552083333333333 4 6 no
5 John Hello! 44294 0.552083333333333 4 6 yes
This is not the desired result. First, the date and the hour are wrong. Second, columns' labels are strange (Twitter, Na., NA..1 and so on). The correct labels are instead in the first rwo of the dataframe. I would like to obtain labels like, e.g., the following:
Twitter.User, Twitter.Tweet, Twitter.Date, Twitter.Hour, Twitter.x, Twitter.y, Twitter.z
Try read.xlsx("2021.xlsx", sheetIndex = 1, startRow = 2)
I want to generate a heatmap from a 8*6 dataframe. The last row in the dataframe has the information to annotate the columns. Structure of the dataframe is as follows:
heatmap_try <-structure(list(BGC0000041 = structure(c(1L, 2L, 1L, 1L, 1L, 3L
), .Label = c("0", "0.447458977", "a"), class = "factor"), BGC0000128 = structure(c(1L,
1L, 1L, 3L, 2L, 4L), .Label = c("0", "1.785875195", "4.093659107",
"a"), class = "factor"), BGC0000287 = structure(c(1L, 1L, 1L,
3L, 2L, 4L), .Label = c("0", "1.785875195", "4.456229186", "b"
), class = "factor"), BGC0000294 = structure(c(3L, 1L, 2L, 4L,
1L, 5L), .Label = c("0", "2.035046947", "3.230553742", "3.286304185",
"b"), class = "factor"), BGC0000295 = structure(c(1L, 1L, 1L,
2L, 1L, 3L), .Label = c("0", "2.286304185", "c"), class = "factor"),
BGC0000308 = structure(c(4L, 2L, 3L, 5L, 1L, 6L), .Label = c("6.277728291",
"6.313707588", "6.607936616", "6.622871165", "6.64385619",
"c"), class = "factor"), BGC0000323 = structure(c(1L, 2L,
1L, 1L, 1L, 3L), .Label = c("0", "0.447458977", "c"), class = "factor"),
BGC0000328 = structure(c(1L, 2L, 1L, 1L, 1L, 3L), .Label = c("0",
"0.447458977", "c"), class = "factor")), class = "data.frame", row.names = c("Gut",
"Oral", "Anterior_nares", "Retroauricular_crease", "Vagina",
"AL"))
My code for heatmap generation is as follows (I am using pheatmap library):
library(pheatmap)
heatmap_data1 <- heatmap_try[ c(1:5), c(1:8) ]
anotation_data <- as.data.frame(t(heatmap_try[6, ]))
row.names(anotation_data) <- colnames(heatmap_data1)
pheatmap(heatmap_data1, annotation_col = anotation_data, color = colorRampPalette(c("white","blue"))(n=100),cellwidth = 40,cellheight = 6,fontsize_row = 5,cluster_rows = F,cluster_cols = F)
However, I am getting the following error:
Error in cut.default(x, breaks = breaks, include.lowest = T) :
'x' must be numeric
What I am doing wrong?
Thanks!
This is because the columns of heatmap_data1 are factors, they need to be numeric. One way to convert is with:
heatmap_data1_num <- as.data.frame(lapply(heatmap_data1,
function(x) as.numeric(as.character(x))))
# then as before
pheatmap(heatmap_data1_num, annotation_col = anotation_data, color = colorRampPalette(c("white","blue"))(n=100),cellwidth = 40,cellheight = 6,fontsize_row = 5,cluster_rows = F,cluster_cols = F)
:)
Is there an easy way to group a particular data set into a reduced data frame from certain characteristics? I was thinking of an algorithm for this, but is there any function in R that can be used for this? I've trying to use dplyr, but it didin't work very well...
E.g.:
P.S .: My data is in an matrix of more than 1Gb, that is, I need a more automatic process.
Example Data:
structure(list(Nun = 1:6, Event = c(1L, 1L, 1L, 1L, 2L, 2L),
Time = structure(c(3L, 4L, 5L, 6L, 1L, 2L), .Label = c("11:34",
"11:36", "8:50", "8:52", "8:54", "8:56"), class = "factor"),
User = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("U1",
"U7"), class = "factor")), .Names = c("Nun", "Event", "Time",
"User"), class = "data.frame", row.names = c(NA, -6L))
You can use summarise from dplyr package:
library(dplyr)
your_data_frame %>%
group_by(User, Event) %>%
summarise(Duration = max(Time) - min(Time))
Here is the data.table way.
Example Data:
x<-structure(list(Nun = 1:6, Event = c(1L, 1L, 1L, 1L, 2L, 2L),
Time = structure(c(1508514600, 1508514720, 1508514840, 1508514960,
1508524440, 1508524560), class = c("POSIXct", "POSIXt"), tzone = ""),
User = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("U1",
"U7"), class = "factor")), .Names = c("Nun", "Event", "Time",
"User"), row.names = c(NA, -6L), class = "data.frame")
Code:
require(data.table)
setDT(x)
x[,list(Duration = max(Time)-min(Time)),by = list(Event,User)]
Event User Duration
1: 1 U1 6 mins
2: 2 U7 2 mins
I have the following table
Code Name Class
1
2 Monday day
5 green color
9
6
1 red color
1
2
9 Tuesday day
6
5
Goal is to the fill the Name and Class columns based on the Code column of a filled row. For example, the second row is filled and the code is 2. I would like to fill all the rows where code = 2 with Name=Monday and Class=day.
I tried using fill() from tidyR but that seems to require ordered data.
structure(list(Code = c(1L, 2L, 5L, 9L, 6L, 1L, 1L, 2L, 9L, 6L,
5L), Name = structure(c(1L, 3L, 2L, 1L, 1L, 4L, 1L, 1L, 5L, 1L,
1L), .Label = c("", "green", "Monday", "red", "Tuesday"), class = "factor"),
Class = structure(c(1L, 3L, 2L, 1L, 1L, 2L, 1L, 1L, 3L, 1L,
1L), .Label = c("", "color", "day"), class = "factor")), .Names = c("Code",
"Name", "Class"), class = "data.frame", row.names = c(NA, -11L
))
library(dplyr)
final_df <- left_join(df, df[df$Name!='',], by='Code')[,c(1,4:5)]
colnames(final_df) <- colnames(df)
final_df
I'm new to R and would like to create a Gantt-style diagram where I can see how long jobs on a SQL Server run over the week.
So my y-axis would be filled with job names and my x-axis has a (in and out zoom-able) scale with Weekdays, hours, minutes and seconds.
My dataset can still be configured. I can transform the start and end times to every format since I have them as DateTimes.
This is how the data looks like:
structure(list(JobName = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("ATLAS_Admin_DeployClientDatabase", "ATLAS_Admin_ParseMasterCubeDatabase"), class = "factor"), RunDateTime = structure(c(1L,3L, 5L, 2L, 4L), .Label = c("2016-11-10T15:39:36.0000000", "2016-11-16T11:30:20.0000000","2016-11-16T11:37:25.0000000", "2016-11-16T15:51:56.0000000","2016-11-16T15:52:59.0000000"), class = "factor"), StartWeekday = structure(c(1L,2L, 2L, 2L, 2L), .Label = c("Thursday", "Wednesday"), class = "factor"), StartTime = structure(c(3L, 2L, 5L, 1L, 4L), .Label = c("1899-12-30T11:30:20.0000000", "1899-12-30T11:37:25.0000000", "1899-12-30T15:39:36.0000000", "1899-12-30T15:51:56.0000000", "1899-12-30T15:52:59.0000000" ), class = "factor"), EndRunDateTime = structure(c(1L, 3L, 5L, 2L, 4L), .Label = c("2016-11-10T16:02:39.0000000", "2016-11-16T11:31:24.0000000", "2016-11-16T12:03:10.0000000", "2016-11-16T15:52:57.0000000", "2016-11-16T16:19:06.0000000"), class = "factor"), EndWeekday = structure(c(1L, 2L, 2L, 2L, 2L), .Label = c("Thursday", "Wednesday"), class = "factor"), EndTime = structure(c(4L, 2L, 5L, 1L, 3L), .Label = c("1899-12-30T11:31:24.0000000", "1899-12-30T12:03:10.0000000", "1899-12-30T15:52:57.0000000", "1899-12-30T16:02:39.0000000", "1899-12-30T16:19:06.0000000" ), class = "factor")), .Names = c("JobName", "RunDateTime","StartWeekday", "StartTime", "EndRunDateTime", "EndWeekday","EndTime"), row.names = c(NA, 5L), class = "data.frame")
The names are linked over the JobID.
In the end it should look like this: Gantt-Diagram with weekdays/times instead of dates
I'm not limited to any library, yet ggplot is already installed.