How to reduce a data frame by grouping data?

How to reduce a data frame by grouping data? - r

:)
Is there an easy way to group a particular data set into a reduced data frame from certain characteristics? I was thinking of an algorithm for this, but is there any function in R that can be used for this? I've trying to use dplyr, but it didin't work very well...
E.g.:
P.S .: My data is in an matrix of more than 1Gb, that is, I need a more automatic process.
Example Data:
structure(list(Nun = 1:6, Event = c(1L, 1L, 1L, 1L, 2L, 2L),
Time = structure(c(3L, 4L, 5L, 6L, 1L, 2L), .Label = c("11:34",
"11:36", "8:50", "8:52", "8:54", "8:56"), class = "factor"),
User = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("U1",
"U7"), class = "factor")), .Names = c("Nun", "Event", "Time",
"User"), class = "data.frame", row.names = c(NA, -6L))

You can use summarise from dplyr package:
library(dplyr)
your_data_frame %>%
group_by(User, Event) %>%
summarise(Duration = max(Time) - min(Time))

Here is the data.table way.
Example Data:
x<-structure(list(Nun = 1:6, Event = c(1L, 1L, 1L, 1L, 2L, 2L),
Time = structure(c(1508514600, 1508514720, 1508514840, 1508514960,
1508524440, 1508524560), class = c("POSIXct", "POSIXt"), tzone = ""),
User = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("U1",
"U7"), class = "factor")), .Names = c("Nun", "Event", "Time",
"User"), row.names = c(NA, -6L), class = "data.frame")
Code:
require(data.table)
setDT(x)
x[,list(Duration = max(Time)-min(Time)),by = list(Event,User)]
Event User Duration
1: 1 U1 6 mins
2: 2 U7 2 mins

Related

Read excel file in R: problem with columns' labels and data/hour format

I have an excel file like this:
which I tried to read by using:
library(xlsx)
df <- read.xlsx("2021.xlsx", sheetIndex = 1)
However, I obtained a result that I do not like very much
> dput(df)
structure(list(Twitter = structure(c(3L, 1L, 1L, 2L, 2L), .Label = c("Jack",
"John", "User"), class = "factor"), NA. = structure(c(5L, 1L,
3L, 4L, 2L), .Label = c("Hello world", "Hello!", "I'm a text",
"I'm an example", "Tweet"), class = "factor"), NA..1 = structure(c(3L,
1L, 1L, 2L, 2L), .Label = c("44293", "44294", "Date"), class = "factor"),
NA..2 = structure(c(3L, 1L, 1L, 2L, 2L), .Label = c("0.490277777777778",
"0.552083333333333", "Hour"), class = "factor"), NA..3 = structure(c(3L,
1L, 1L, 2L, 2L), .Label = c("3", "4", "x"), class = "factor"),
NA..4 = structure(c(3L, 2L, 2L, 1L, 1L), .Label = c("6",
"7", "y"), class = "factor"), NA..5 = structure(c(3L, 2L,
2L, 1L, 2L), .Label = c("no", "yes", "z"), class = "factor")), class = "data.frame", row.names =
c(NA, -5L))
i.e.,
> df
Twitter NA. NA..1 NA..2 NA..3 NA..4 NA..5
1 User Tweet Date Hour x y z
2 Jack Hello world 44293 0.490277777777778 3 7 yes
3 Jack I'm a text 44293 0.490277777777778 3 7 yes
4 John I'm an example 44294 0.552083333333333 4 6 no
5 John Hello! 44294 0.552083333333333 4 6 yes
This is not the desired result. First, the date and the hour are wrong. Second, columns' labels are strange (Twitter, Na., NA..1 and so on). The correct labels are instead in the first rwo of the dataframe. I would like to obtain labels like, e.g., the following:
Twitter.User, Twitter.Tweet, Twitter.Date, Twitter.Hour, Twitter.x, Twitter.y, Twitter.z

Try read.xlsx("2021.xlsx", sheetIndex = 1, startRow = 2)

cut.default error in heatmap generation R

I want to generate a heatmap from a 8*6 dataframe. The last row in the dataframe has the information to annotate the columns. Structure of the dataframe is as follows:
heatmap_try <-structure(list(BGC0000041 = structure(c(1L, 2L, 1L, 1L, 1L, 3L
), .Label = c("0", "0.447458977", "a"), class = "factor"), BGC0000128 = structure(c(1L,
1L, 1L, 3L, 2L, 4L), .Label = c("0", "1.785875195", "4.093659107",
"a"), class = "factor"), BGC0000287 = structure(c(1L, 1L, 1L,
3L, 2L, 4L), .Label = c("0", "1.785875195", "4.456229186", "b"
), class = "factor"), BGC0000294 = structure(c(3L, 1L, 2L, 4L,
1L, 5L), .Label = c("0", "2.035046947", "3.230553742", "3.286304185",
"b"), class = "factor"), BGC0000295 = structure(c(1L, 1L, 1L,
2L, 1L, 3L), .Label = c("0", "2.286304185", "c"), class = "factor"),
BGC0000308 = structure(c(4L, 2L, 3L, 5L, 1L, 6L), .Label = c("6.277728291",
"6.313707588", "6.607936616", "6.622871165", "6.64385619",
"c"), class = "factor"), BGC0000323 = structure(c(1L, 2L,
1L, 1L, 1L, 3L), .Label = c("0", "0.447458977", "c"), class = "factor"),
BGC0000328 = structure(c(1L, 2L, 1L, 1L, 1L, 3L), .Label = c("0",
"0.447458977", "c"), class = "factor")), class = "data.frame", row.names = c("Gut",
"Oral", "Anterior_nares", "Retroauricular_crease", "Vagina",
"AL"))
My code for heatmap generation is as follows (I am using pheatmap library):
library(pheatmap)
heatmap_data1 <- heatmap_try[ c(1:5), c(1:8) ]
anotation_data <- as.data.frame(t(heatmap_try[6, ]))
row.names(anotation_data) <- colnames(heatmap_data1)
pheatmap(heatmap_data1, annotation_col = anotation_data, color = colorRampPalette(c("white","blue"))(n=100),cellwidth = 40,cellheight = 6,fontsize_row = 5,cluster_rows = F,cluster_cols = F)
However, I am getting the following error:
Error in cut.default(x, breaks = breaks, include.lowest = T) :
'x' must be numeric
What I am doing wrong?
Thanks!

This is because the columns of heatmap_data1 are factors, they need to be numeric. One way to convert is with:
heatmap_data1_num <- as.data.frame(lapply(heatmap_data1,
function(x) as.numeric(as.character(x))))
# then as before
pheatmap(heatmap_data1_num, annotation_col = anotation_data, color = colorRampPalette(c("white","blue"))(n=100),cellwidth = 40,cellheight = 6,fontsize_row = 5,cluster_rows = F,cluster_cols = F)

Melting and converting badly labeled likert Scale R

on my survey I made a mistake for a 5 point likert scale as follows:
dput(head(edu_data))
structure(list(Education.1. = structure(c(1L, 1L, 1L, 1L, 1L,
1L), .Label = c("", "Y"), class = "factor"), Education.2. = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = c("", "Y"), class = "factor"),
Education.3. = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("",
"Y"), class = "factor"), Education.4. = structure(c(1L, 1L,
1L, 2L, 2L, 1L), .Label = c("", "Y"), class = "factor"),
Education.5. = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("",
"Y"), class = "factor")), row.names = c(NA, 6L), class = "data.frame")
I would like to change this into one column with a single value such that
answer_to_ls= 1:5
The output I want to get would be a column with a single number and that means getting rid of the letter. I do off course have a unique respondent's ID
Please tell me if I can somehow be more clear in the style of my question as I want to be a valuable member of the comunity.

I think there are a lot of potential solutions available, try a search of merging or collapsing multiple binary or dichotomous columns into a single column. For example:
R - Convert various dummy/logical variables into a single categorical variable/factor from their name
In your case, you could try something like:
edu_data$answer_to_ls <- apply(edu_data[1:5] == "Y", 1, function(x) { if (any(x)) { as.numeric(gsub(".*(\\d+).", "\\1", names(which(x)))) } else NA })
This will extract the number from the column name for the Likert scale response 1 to 5, make it a numeric value, and include NA if there are no "Y" responses. edu_data[1:5] selects those columns to consider for conversion, in this case columns 1 through 5.
Education.1. Education.2. Education.3. Education.4. Education.5. answer_to_ls
1 Y 5
2 Y 5
3 Y 5
4 Y 4
5 Y 4
6 NA

d <- structure(list(Education.1. = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "Y"), class = "factor"),
Education.2. = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "Y"), class = "factor"),
Education.3. = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "Y"), class = "factor"),
Education.4. = structure(c(1L, 1L, 1L, 2L, 2L, 1L), .Label = c("", "Y"), class = "factor"),
Education.5. = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("", "Y"), class = "factor")),
row.names = c(NA, 6L), class = "data.frame")
d$item1 <- 1 * (d$Education.1 == "Y") +
2 * (d$Education.2 == "Y") +
3 * (d$Education.3 == "Y") +
4 * (d$Education.4 == "Y") +
5 * (d$Education.5 == "Y")
print(d)
leads to
> print(d)
Education.1. Education.2. Education.3. Education.4. Education.5. item1
1 Y 5
2 Y 5
3 Y 5
4 Y 4
5 Y 4
6 0

Special characters in a column: mess in the table

I have a problem with special characters in a column of a table.
Here is an example of the data:
structure(list(shipType = structure(c(1L, 3L, 1L, 2L, 4L), .Label = c("CARGO",
"FISHING", "TOWING_LONG_WIDE", "UNKNOWN"), class = "factor"),
shipCargo = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "UNDEFINED", class = "factor"),
destination = structure(c(3L, 1L, 2L, 4L, 5L), .Label = c("\\KORSOR ;.,NA,.\\",
"LEHTMA", "RIGA", "TALLIN", "VYBORG"), class = "factor"),
eta = structure(c(1L, 2L, 5L, 3L, 4L), .Label = c("01/01 00:00 UTC",
"01/01 09:00 UTC", "24/12 16:00 UTC", "26/12 07:00 UTC",
"30/12 16:00 UTC"), class = "factor"), imo = structure(c(3L,
5L, 1L, 4L, 2L), .Label = c("7101891", "7406318", "9066045",
"9158185", "Russia"), class = "factor"), callsign = structure(c(5L,
1L, 2L, 3L, 4L), .Label = c("12", "UALB", "UBYK8", "UFPC",
"UICC"), class = "factor"), country = structure(c(2L, 1L,
2L, 2L, 2L), .Label = c("2014-12-29", "Russia"), class = "factor"),
month = c(12L, 1L, 12L, 12L, 12L), date = structure(c(2L,
1L, 2L, 2L, 2L), .Label = c("", "2014-12-29"), class = "factor"),
week = c(1L, NA, 1L, 1L, 1L), X = c(NA, NA, NA, NA, NA)), .Names = c("shipType",
"shipCargo", "destination", "eta", "imo", "callsign", "country",
"month", "date", "week", "X"), class = "data.frame", row.names = c(NA,
-5L))
As you can see on the second row, there is a problem in the column "destination" when reading the file with the following code
data <- read.table(file, header=T, fill=T, sep=",")
I have tried different things, such as: exporting with quotes and without headers
data <- read.table(file, sep=",", fill=T, head=F, quote="")
and then removing the first line (the actual headers that are in the table...) and adding one more time these headers
data <- data[-1,]
colnames(data)<-c( "shipType", "shipCargo","destination","eta","imo","callsign", "country","month","date","week")
It looks better, but there are a lot of special characters and it will be time consuming / source of errors (I have lot of tables..) to edit.
Is there a way to avoid the columns to be messed up when importing the file?
Thank you!

Comparing the position of 1's is matched in the strings in r

Suppose I am reading a .csv file from R whose columns contain strings of 0s and 1s. Suppose I need to compare the position of 1's and if matched then count as 1 per match and put that count in the third column.
Illustration:
dput(head(string_data))
structure(list(v_1 = structure(c(1L, 1L, 1L, 1L, 3L, 1L), .Label = c("",
"0,0,0,1", "0,0,1,0", "0,1,0,0", "1,1,0,0"), class = "factor"),
v_2 = structure(c(1L, 1L, 1L, 1L, 2L, 1L), .Label = c("",
"1,0,1,0"), class = "factor"), v_3 = structure(c(1L, 1L,
1L, 1L, 4L, 1L), .Label = c("", "0,0,0,1", "0,0,1,0", "1,0,0,0"
), class = "factor"), v_4 = structure(c(1L, 1L, 1L, 1L, 2L,
1L), .Label = c("", "0,0,0,1"), class = "factor"), v_5 = structure(c(1L,
5L, 1L, 1L, 1L, 2L), .Label = c("", "0,0,0,0,0", "0,0,0,1,0",
"0,0,1,0,0", "1,0,1,1,0"), class = "factor"), v_6 = structure(c(1L,
2L, 1L, 1L, 1L, 2L), .Label = c("", "1,0,1,1,0"), class = "factor"),
v_7 = structure(c(1L, 1L, 1L, 1L, 1L, 2L), .Label = c("",
"0,0,0,0", "0,0,0,1", "0,1,0,0", "1,0,0,0"), class = "factor"),
v_8 = structure(c(1L, 1L, 1L, 1L, 1L, 2L), .Label = c("",
"1,0,0,0"), class = "factor")), .Names = c("v_1", "v_2",
"v_3", "v_4", "v_5", "v_6", "v_7", "v_8"), row.names = c(NA,
6L), class = "data.frame")
Above I have pasted dput of head data.
I need to compare the position of 1's in (2*i-1) column with (2*i)th column (i =1,2,...,8) and put that in a third column. as number of matches.
e.g.
Suppose I have a string 0,0,1,1 in first column and 0,1,1,1 in second column then in the third column it should return 2.
Can anyone please help me out with this one.
EDIT
The counting in the third column should be based on the number of 1's in the second column string. In above e.g. second column string is 0,1,1,1 which implies it the count can very from 0 to 3.

This couple of functions might be helpful as a starter:
# Compares two strings and computes number of '1's at matching positions
f <- function(s1, s2) {
if (s1=='' || s2=='') return(0)
m <- do.call(cbind,strsplit(c(s1,s2),','))
m2 <- rowMeans(m=="1")
sum(m2==1.0)
}
# Calls `f()` for every row of two columns i and j from a data set d and returns a vector
# that could be used as a new column
f.cols <- function(d,i,j) {
c1 <- as.character(d[,i])
c2 <- as.character(d[,j])
unname(mapply(f,c1,c2))
}
Example of use:
d$out <- f.cols(d,1,2)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to reduce a data frame by grouping data? - r

You can use summarise from dplyr package: library(dplyr) your_data_frame %>% group_by(User, Event) %>% summarise(Duration = max(Time) - min(Time))

Related

Read excel file in R: problem with columns' labels and data/hour format

cut.default error in heatmap generation R

Melting and converting badly labeled likert Scale R

Special characters in a column: mess in the table

Comparing the position of 1's is matched in the strings in r

Categories

Resources