How to use table and aggregate in R - r

I have a dataframe called tt. I would like to know how many different types of variants I have for different STATUS using aggregate and table functions. I tried aggregate(tt$STATUS, by = list(tt$variant), table) , but it gives me weird column names that I couldn't understand. How do I properly do this?
tt <- structure(list(ID = structure(1:11, .Names = c("9", "10", "11",
"12", "13", "2280", "2415", "3096", "4095", "6437", "7642"), .Label = c("003-0029-0258443",
"003-0039-0349951", "003-0041-0357849", "003-0042-0388658", "010-0001-0040921",
"4_596_8", "5_26202_105", "64368", "A-ADC-AD002860", "MAP_64368",
"S0085"), class = "factor"), variant = structure(c(`9` = 1L,
`10` = 1L, `11` = 1L, `12` = 1L, `13` = 1L, `2280` = 2L, `2415` = 2L,
`3096` = 3L, `4095` = 2L, `6437` = 3L, `7642` = 3L), .Label = c("0/0",
"0/1", "1/1"), class = "factor"), STATUS = structure(c(`9` = 2L,
`10` = 2L, `11` = 2L, `12` = 2L, `13` = 2L, `2280` = 1L, `2415` = 1L,
`3096` = 1L, `4095` = 1L, `6437` = 1L, `7642` = 2L), .Label = c(" 1",
"-9"), class = "factor")), class = "data.frame", row.names = c("9",
"10", "11", "12", "13", "2280", "2415", "3096", "4095", "6437",
"7642"))

If we need to apply table, can use directly after subsetting the columns instead of applying it within aggregate
table(tt[c("STATUS", "variant")])
# variant
#STATUS 0/0 0/1 1/1
# 1 0 3 2
# -9 5 0 1

Related

Update row values based on condition in R

I am trying to update the values of Column C2 and C3 based on conditions:
• The variable C2 is equal to 1 if the type of cue = 2,
and 0 otherwise.
• The variable C3 is equal to 1 if the type of cue = 3,
and 0 otherwise.
Data frame Image: https://drive.google.com/file/d/1Enik09cXQ21d3cQQv0_YQDZGBb3Btm5n/view?usp=sharing
dput(Cognitive[1:6,]) =
structure(list(Subject = c(1L, 1L, 1L, 1L, 1L, 1L), Time = c(191L,
206L, 219L, 176L, 182L, 196L), W = c(0L, 0L, 0L, 1L, 1L, 1L),
Cue = c(1L, 2L, 3L, 1L, 2L, 3L), D = c(0L, 0L, 0L, 0L, 0L,
0L), Subject.f = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12",
"13", "14", "15", "16", "17", "18", "19", "20", "21", "22",
"23", "24"), class = "factor"), Cue.f = structure(c(1L, 2L,
3L, 1L, 2L, 3L), .Label = c("1", "2", "3"), class = "factor"),
D.f = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("0",
"1"), class = "factor"), C2 = structure(c(1L, 2L, 3L, 1L,
2L, 3L), .Label = c("1", "2", "3"), class = "factor"), C3 = structure(c(1L,
2L, 3L, 1L, 2L, 3L), .Label = c("1", "2", "3"), class = "factor")), row.names = c(NA,
6L), class = "data.frame")
Cognitive <- read.csv(file = 'Cognitive.csv')
View(Cognitive)
# Factor the variables Subject, Cue and D and add these variable to the Cognitive data frame.
Cognitive <- mutate(Cognitive, Subject.f = factor(Subject), Cue.f = factor(Cue), D.f = factor(D))
Cognitive <- mutate(Cognitive, C2 = Cue.f, C3 = Cue.f)
Thanks.
df %>%
mutate(C2 = case_when(cue == 2 ~ 1
TRUE ~ 0),
C3 = case_when(cue ==3 ~ 1,
TRUE ~ 0))
A super easy base solution
df <- data.frame(cue=sample(c(1:3),10,replace = T),c2=sample(c(0,1),10,replace = T),c3=sample(c(0,1),10,replace = T))
df$c2 <- ifelse(df$cue==2,1,0)
df$c3 <- ifelse(df$cue==3,1,0)
EDIT
to add another dplyr solution
df <- dplyr::mutate(df,c2= ifelse(cue==2,1,0),c3= ifelse(cue==3,1,0))
We can use sapply in base R
df[-1] <- +(sapply(c(2, 3), `==`, df$cue))

Read excel file in R: problem with columns' labels and data/hour format

I have an excel file like this:
which I tried to read by using:
library(xlsx)
df <- read.xlsx("2021.xlsx", sheetIndex = 1)
However, I obtained a result that I do not like very much
> dput(df)
structure(list(Twitter = structure(c(3L, 1L, 1L, 2L, 2L), .Label = c("Jack",
"John", "User"), class = "factor"), NA. = structure(c(5L, 1L,
3L, 4L, 2L), .Label = c("Hello world", "Hello!", "I'm a text",
"I'm an example", "Tweet"), class = "factor"), NA..1 = structure(c(3L,
1L, 1L, 2L, 2L), .Label = c("44293", "44294", "Date"), class = "factor"),
NA..2 = structure(c(3L, 1L, 1L, 2L, 2L), .Label = c("0.490277777777778",
"0.552083333333333", "Hour"), class = "factor"), NA..3 = structure(c(3L,
1L, 1L, 2L, 2L), .Label = c("3", "4", "x"), class = "factor"),
NA..4 = structure(c(3L, 2L, 2L, 1L, 1L), .Label = c("6",
"7", "y"), class = "factor"), NA..5 = structure(c(3L, 2L,
2L, 1L, 2L), .Label = c("no", "yes", "z"), class = "factor")), class = "data.frame", row.names =
c(NA, -5L))
i.e.,
> df
Twitter NA. NA..1 NA..2 NA..3 NA..4 NA..5
1 User Tweet Date Hour x y z
2 Jack Hello world 44293 0.490277777777778 3 7 yes
3 Jack I'm a text 44293 0.490277777777778 3 7 yes
4 John I'm an example 44294 0.552083333333333 4 6 no
5 John Hello! 44294 0.552083333333333 4 6 yes
This is not the desired result. First, the date and the hour are wrong. Second, columns' labels are strange (Twitter, Na., NA..1 and so on). The correct labels are instead in the first rwo of the dataframe. I would like to obtain labels like, e.g., the following:
Twitter.User, Twitter.Tweet, Twitter.Date, Twitter.Hour, Twitter.x, Twitter.y, Twitter.z
Try read.xlsx("2021.xlsx", sheetIndex = 1, startRow = 2)

R Mlogit Error in replacement rows and impossible to coerce choice variable to logical

I want to run the mlogit function on my discrete choice dataset. Below I have provided the data and the lines of R code utilizing the package mlogit that are producing the errors. I am getting an error when running the full dataset about the dimensions for mlogit.data. When running mlogit, I am getting an error around coercing the "choice" variable to a logical.
Here is a snapshot of the code and the structure to reproduce a subset of it in R.
structure(list(row_id = c("1a", "1b", "1c", "1d", "1e", "1g"), choice = structure(c(2L, 1L, 2L, 1L, 2L, 1L), .Label = c("1", "2"), class = "factor"), alt_var = structure(c(1L, 2L, 1L, 2L, 1L, 2L), .Label = c("1", "2"), class = "factor"), meal_choice = structure(c(1L, 2L, 1L, 1L, 1L, 2L), .Label = c("1", "2"), class = "factor"), transport_choice = structure(c(1L, 2L, 1L, 2L, 2L, 2L), .Label = c("1", "2"), class = "factor"), packaging_source = structure(c(1L, 2L, 1L, 1L, 2L, 2L), .Label = c("1", "2"), class = "factor"), disposal_choice = structure(c(1L, 2L, 1L, 1L, 2L, 2L), .Label = c("1", "2"), class = "factor")), row.names = c(NA, -6L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x7fa6d980f8e0>)
mlogit.data(data = df, choice = "choice", shape = "long", alt.var = "alt_var", id.var = "row_id", drop.index = TRUE)
Error in `$<-.data.frame`(x, name, value) : replacement has 964 rows, data has 965
mlogit(choice ~ meal_choice + transport_choice + packaging_source + disposal_choice, df, reflevel = "car")
`Error in dfidx::dfidx(data = data, dfa$idx, drop.index = dfa$drop.index, :
impossible to coerce the choice variable to a logical`

fast, efficient way to loop over millions of rows and match columns

I'm working with eye tracking data right now, so have a HUGE dataset (think millions of rows) and so would like a fast way to do this task. Here's a simplified version of it.
The data tells you where the eye is looking at each time point, and for each file we are looking at. X1,Y1 to the coordinates of the point we're looking at. There are multiple time points for each file (representing the eye looking at different location in the file through time).
Filename Time X1 Y1
1 1 10 10
1 2 12 10
I also have a file of where items are located for each filename. Each file contains (in this simplified case) two objects. X1,Y1 are the lower left coordinates and X2, Y2 are the upper right. You can imagine this as giving the bounding box where the item is located in each file. E.g.
Filename Item X1 Y1 X2 Y2
1 Dog 11 10 20 20
What I'd like to do is add another column to the first data frame that tells me what object the person is looking at during each time for each file. If there are not looking at any of the objects, I'd like the column to say "none". Things on the border count at as being looked at. E.g.
Filename Time X1 Y1 LookingAt
1 1 10 10 none
1 2 12 11 Dog
I know how to do this the for loop way, but it takes forever (and crashed my RStudio). I'm wondering if there might be a faster, more efficient way I'm missing.
Here's the dput for the first dataframe (These contain more rows that the example I showed above):
structure(list(Filename = structure(c(1L, 1L, 1L, 2L, 2L, 3L,
3L, 3L, 3L), .Label = c("1", "2", "3"), class = "factor"), Time = structure(c(1L,
2L, 3L, 1L, 2L, 1L, 2L, 4L, 5L), .Label = c("1", "2", "3", "5",
"6"), class = "factor"), X1 = structure(c(1L, 4L, 3L, 2L, 1L,
4L, 6L, 5L, 1L), .Label = c("10", "11", "12", "15", "20", "25"
), class = "factor"), Y1 = structure(c(1L, 5L, 6L, 4L, 1L, 2L,
3L, 4L, 1L), .Label = c("10", "11", "12", "15", "20", "25"), class = "factor")), .Names = c("Filename",
"Time", "X1", "Y1"), row.names = c(NA, -9L), class = "data.frame")
And here's the dput for the second:
structure(list(Filename = structure(c(1L, 1L, 2L, 2L), .Label = c("1",
"3"), class = "factor"), Item = structure(1:4, .Label = c("Cat",
"Dog", "House", "Mouse"), class = "factor"), X1 = structure(c(2L,
4L, 3L, 1L), .Label = c("10", "11", "20", "35"), class = "factor"),
Y1 = structure(c(2L, 4L, 3L, 1L), .Label = c("10", "11",
"13", "35"), class = "factor"), X2 = structure(c(1L, 3L,
4L, 2L), .Label = c("10", "11", "20", "35"), class = "factor"),
Y2 = structure(c(1L, 3L, 4L, 2L), .Label = c("10", "11",
"13", "35"), class = "factor")), .Names = c("Filename", "Item",
"X1", "Y1", "X2", "Y2"), row.names = c(NA, -4L), class = "data.frame")
Using data.table and the sample data you provided, I would approach it as follows:
# getting the data in the right format
datcols <- c("X","Y")
lucols <- c("X1","X2","Y1","Y2")
setDT(dat)[, (datcols) := lapply(.SD, function(x) as.numeric(as.character(x))), .SDcol = datcols
][, Filename := as.character(Filename)]
setDT(lu)[, (lucols) := lapply(.SD, function(x) as.numeric(as.character(x))), .SDcol = lucols
][, `:=` (Filename = as.character(Filename),
X1 = pmin(X1,X2), X2 = pmax(X1,X2), # make sure that 'X1' is always the lowest value
Y1 = pmin(Y1,Y2), Y2 = pmax(Y1,Y2))] # make sure that 'Y1' is always the lowest value
# matching the 'Items' to the correct rows
dat[, looked_at := lu$Item[Filename==lu$Filename &
between(X, lu$X1, lu$X2) &
between(Y, lu$Y1, lu$Y2)],
by = .(Filename,Time)]
which gives:
> dat
Filename Time X Y looked_at
1: 1 1 10 10 Cat
2: 1 2 15 20 NA
3: 1 3 12 25 NA
4: 2 1 11 15 NA
5: 2 2 10 10 NA
6: 3 1 15 11 NA
7: 3 2 25 12 NA
8: 3 5 20 15 House
9: 3 6 10 10 Mouse
Used data:
dat <- structure(list(Filename = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("1", "2", "3"), class = "factor"),
Time = structure(c(1L, 2L, 3L, 1L, 2L, 1L, 2L, 4L, 5L), .Label = c("1", "2", "3", "5", "6"), class = "factor"),
X = structure(c(1L, 4L, 3L, 2L, 1L, 4L, 6L, 5L, 1L), .Label = c("10", "11", "12", "15", "20", "25"), class = "factor"),
Y = structure(c(1L, 5L, 6L, 4L, 1L, 2L, 3L, 4L, 1L), .Label = c("10", "11", "12", "15", "20", "25"), class = "factor")),
.Names = c("Filename", "Time", "X", "Y"), row.names = c(NA, -9L), class = "data.frame")
lu <- structure(list(Filename = structure(c(1L, 1L, 2L, 2L), .Label = c("1", "3"), class = "factor"),
Item = structure(1:4, .Label = c("Cat", "Dog", "House", "Mouse"), class = "factor"),
X1 = structure(c(2L, 4L, 3L, 1L), .Label = c("10", "11", "20", "35"), class = "factor"),
X2 = structure(c(1L, 3L, 4L, 2L), .Label = c("10", "11", "20", "35"), class = "factor"),
Y1 = structure(c(2L, 4L, 3L, 1L), .Label = c("10", "11", "13", "35"), class = "factor"),
Y2 = structure(c(1L, 3L, 4L, 2L), .Label = c("10", "11", "13", "35"), class = "factor")),
.Names = c("Filename", "Item", "X1", "X2", "Y1", "Y2"), row.names = c(NA, -4L), class = "data.frame")

How to generate a sequence based on two columns in R?

Below you can recreate my data in R. I would like to generate a sequence of numbers based on two individual columns. In this example of real data my column names are :
df= or10x1BC
"Tank" "Core" "BCl" "BCu" "Mid" "TL" "SL"
I wish to use the value in each row from BCu and BCl to generate a sequence by 0.001. For example seq(BCu[1], BCl[1], 0.001) will generate a sequence based on the first row in each, I wish to have this work for each row down the list.
Ultimately this sequence will be used in my function to make an average of the sequence, i.e. mean(function(seq(Bcu[i], BCl[j], 0.001)) and be added to a new column or10x1BC["meanBVF"] = mean(function(seq(Bcu[i], BCl[j], 0.001)).
See data below:
structure(list(Tank = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = "1", class = "factor"), Core = structure(c(1L,
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"),
BCl = structure(c(8L, 5L, 2L, 6L, 3L, 1L, 9L, 7L, 4L), .Label = c("17",
"18", "22", "22.3", "23", "26", "27.3", "28", "29"), class = "factor"),
BCu = structure(c(8L, 5L, 2L, 6L, 3L, 1L, 9L, 7L, 4L), .Label = c("12.5",
"13.5", "17", "17.8", "18", "22", "22.3", "23", "27.3"), class = "factor"),
Mid = structure(c(8L, 5L, 2L, 6L, 3L, 1L, 9L, 7L, 4L), .Label = c("14.75",
"15.75", "19.5", "20.05", "20.5", "24", "24.8", "25.5", "28.15"
), class = "factor"), TL = structure(c(2L, 2L, 2L, 1L, 1L,
1L, 3L, 3L, 3L), .Label = c("26", "28", "29"), class = "factor"),
SL = structure(c(4L, 4L, 3L, 2L, 4L, 3L, 1L, 4L, 3L), .Label = c("1.7",
"4", "4.5", "5"), class = "factor")), .Names = c("Tank",
"Core", "BCl", "BCu", "Mid", "TL", "SL"), row.names = c(NA, -9L
), class = "data.frame")
mapply is like apply, or lapply, but with multiple arguments:
First, as I mentioned in the comment, we need to convert your data to numeric. I did it like this, to convert all but the second column:
df[, -2] = lapply(df[, -2], as.character)
df[, -2] = lapply(df[, -2], as.numeric)
We can then use mapply like this to generate the sequences:
seqs = mapply(FUN = function(a, b) {
seq(from = a, to = b, by = .001)
}, a = df$BCu, b = df$BCl)
It seems messy to put that in the data frame, but you can if you'd like:
df$seqs = seqs
If it were me, I'd probably leave it as a list of vectors outside of the data frame.

Resources