Pulling columns based on values in a row - r

I am looking for a way to use the values in the first row to help filter values I want. Say if I want to keep certain columns in R based on the values in the first row. So in the first row, we have -0.5, 0.7, 1.1, and -1.2.
I want to only keep values that are equal to or greater than 1, or less than or equal to -1.2. Everything else will just be dropped.
So say my original data I have is DF1
ID
Location
XPL
SNA
AAS
APA
First
Park
-0.5
0.7
1.1
-1.2
Second
School
2
5
2
3
Second
Home
4
5
6
4
Third
Car
1
8
8
5
Third
Lake
7
5
4
6
Fourth
Prison
4
5
1
7
With the filter, I would now have a new DF:
ID
Location
AAS
APA
First
Park
1.1
-1.2
Second
School
2
3
Second
Home
6
4
Third
Car
8
5
Third
Lake
4
6
Fourth
Prison
1
7
What would be the best way for this. I feel there is a way to sort columns based on values from a row, but I am unable to think of the way we can with certain commands.
ID <- c("First", "Second", "Second", "Third", "Third", "Fourth")
Location <- c("Park", "School", "Home", "Car", "Lake", "Prison")
XPL <- c(-0.5,2,4,1,7,4)
SNA <- c(0.7,5,5,8,5,5)
AAS <- c(1.1,2,6,8,4,1)
APA <- c(-1.2,3,4,5,6,7)
DF1 <- data.frame(ID, Location, XPL, SNA, AAS,APA)

In dplyr, you can select numeric columns whose first absolute value is above 1:
library(dplyr)
DF1 %>%
select(!where(~ is.numeric(.x) && abs(first(.x)) <= 1))
# ID Location AAS APA GOP
# 1 First Park 1.1 -1.2 1.4
# 2 Second School 2.0 3.0 1.0
# 3 Second Home 6.0 4.0 2.0
# 4 Third Car 8.0 5.0 2.0
# 5 Third Lake 4.0 6.0 3.0
# 6 Fourth Prison 1.0 7.0 3.0
Or with between:
DF1 %>%
select(!where(~ is.numeric(.x) && between(first(.x), -1.19, 0.99)))

If you are using the first row as the basis, you can convert it to a normal integer vector and use the which function to know the indexes that will be kept.
test.row <- as.numeric(DF1[1,3:6])
the 3 and 6 corresponds to the range of index from XPL to APA.
DF1 <- DF1[,c(1:2, 2 + which(test.row >= 1 | test.row <= -1.2))]
we keep the columns ID and Location as 1:2 and we offset 2 to the which function.

Related

Getting name based on max value of a column with NA removed in one liner

I'm trying to get the name of an animal who has the max value of rem sleep. This is what I'm doing right now but hoping for a better way that returns exact value.
msleep = ggplot2::msleep
msleep[order(msleep$sleep_rem, na.last=TRUE, decreasing=TRUE), ]
The above returns me the sorted data but it's hard to see in console in Rstudio. Is there a better way to do this?
We can use which.max to get the index of max value in 'sleep_rem' and use that to subset the 'name'
msleep$name[which.max(msleep$sleep_rem)]
#[1] "Thick-tailed opposum"
For better viewing you can arrange the data and select only the interested columns -
library(dplyr)
msleep %>% arrange(desc(sleep_rem)) %>% select(name, sleep_rem)
# A tibble: 83 x 2
# name sleep_rem
# <chr> <dbl>
# 1 Thick-tailed opposum 6.6
# 2 Giant armadillo 6.1
# 3 North American Opossum 4.9
# 4 Big brown bat 3.9
# 5 European hedgehog 3.5
# 6 Thirteen-lined ground squirrel 3.4
# 7 Domestic cat 3.2
# 8 Long-nosed armadillo 3.1
# 9 Golden hamster 3.1
#10 Golden-mantled ground squirrel 3
# … with 73 more rows

Assigning elements of one vector to elements of another with R

I would like to assign elements of one vector to elements of another for every single user.
For example:
Within a data frame with the variables "user", "activities" and "minutes" (see below), I would like to assign, for example, the duration (4 minutes) of the first activity (4 minutes to activity "READ") of user 1 to new variable READ_duration. Then duration (5 minutes) of second activity ("EDIT") to the new variable EDIT_duration. And the duration (2 minutes) of third activity (again "READ") to the new variable READ_duration.
user <- 1,2,3
activities <- c("READ","EDIT","READ"), c("READ","EDIT", "WRITE"), c("WRITE","EDIT")
minutes <- c(4,5,2), c(3.5, 1, 2), c(4.5,3)
Output should be like: in a data frame with the assigned minutes to the activities:
user READ_duration EDIT_duration WRITE_duration
1 6 5 0
2 3.5 1 2
3 0 3 4.5
The tricky thing here is the algorithm needs to consider that the activities are not in the same order for every user. For example, user 3 starts with writing and therefore the duration 4.5 needs to be assigned to column 4 WRITE_duration.
Also, a loop-function would be needed due to a massive amount of users.
Thank you so much for your help!!
This needs a simple reshape to wide format with sum as an aggregation function.
Prepare a long-format data.frame:
user <- c(1,2,3)
activities <- list(c("READ","EDIT","READ"), c("READ","EDIT", "WRITE"), c("WRITE","EDIT"))
minutes <- list(c(4,5,2), c(3.5, 1, 2), c(4.5,3))
DF <- Map(data.frame, user = user, activities = activities, minutes = minutes)
DF <- do.call(rbind, DF)
# user activities minutes
#1 1 READ 4.0
#2 1 EDIT 5.0
#3 1 READ 2.0
#4 2 READ 3.5
#5 2 EDIT 1.0
#6 2 WRITE 2.0
#7 3 WRITE 4.5
#8 3 EDIT 3.0
Reshape:
library(reshape2)
dcast(DF, user ~ activities, value.var = "minutes", fun.aggregate = sum)
# user EDIT READ WRITE
#1 1 5 6.0 0.0
#2 2 1 3.5 2.0
#3 3 3 0.0 4.5
in base R you could do:
xtabs(min~ind+values, cbind(stack(setNames(activities, user)), min = unlist(minutes)))
values
ind EDIT READ WRITE
1 5.0 6.0 0.0
2 1.0 3.5 2.0
3 3.0 0.0 4.5

R - Create random subsamples of size n for multiple sample groups

I have a large data set of samples that belong to different groups and differ in the area covered. The structure of the data set is simplified below. I now would like to create pooled samples (Subgroups) for each Group where the area covered by each Subgroup equates to a specified area (e.g. 20). Samples should be allocated randomly and without replacement to each Subgroup and the number of the Subgroup should be listed in a new column at the end of the data frame.
SampleID Group Area Subgroup
1 A 1.5 1
2 A 3.8 2
3 A 6 4
4 A 1.9 1
5 A 1.5 3
6 A 4.1 1
7 A 3.7 1
8 A 4.5 3
...
300 B 1.2 1
301 B 3.8 1
302 B 4.1 4
303 B 2.6 3
304 B 3.1 5
305 B 3.5 3
306 B 2.1 2
...
2000 S 2.7 5
...
I am currently using the ‘cumsum’ command to create the Subgroups, using the code below.
dat <- read.table("Pooling_Test.txt", header = TRUE, sep = "\t")
dat$CumArea <- cumsum(dat$Area)
dat$Diff_CumArea <- c(0, head(cumsum(dat$Area), -1))
dat$Sample_Int_1 <- "0"
dat$Sample_End <- "0"
current.sum <- 0
for (c in 1:nrow(dat)) {
current.sum <- current.sum + dat[c, "Area"]
dat[c, "Diff_CumArea"] <- current.sum
if (current.sum >= 20) {
dat[c, "Sample_Int_1"] <- "1"
dat[c, "Sample_End"] <- "End"
current.sum <- 0
dat$Sample_Int_2 <- cumsum(dat$Sample_Int_1)+1
dat$Sample_Final <- dat$Sample_Int_2
for (d in 1:nrow(dat)) {
if (dat$Sample_End[d] == 'End')
dat$Subgroup[d] <- dat$Sample_Int_2[d]-1
else 0 }
}}
write.csv(dat, file = 'Pooling_Test_Output.csv', row.names = FALSE)
The resultant data frame shows what I want (see below). However, there are a couple of steps I would like to improve. First, I have problems including a command for choosing samples randomly from each Group, so I currently randomise the order of samples before loading the data frame into R. Secondly, in the output table the Subgroups are numbered consecutively, but I would like to start the Subgroup numbering with 1 for each new Group. Has anybody any advice on how to achieve this?
SampleID Group CumArea Subgroups
1 A 1.5 1
77 A 4.6 1
6 A 9.3 1
43 A 16.4 1
17 A 19.5 1
67 A 2.1 2
4 A 4.3 2
32 A 8.9 2
...
300 B 4.5 10
257 B 6.8 10
397 B 10.6 10
344 B 14.5 10
367 B 16.7 10
303 B 20.1 10
306 B 1.5 11
...
A few functions in the dplyr package make this fairly straightforward. You can use slice to randomize the data, group_by to perform computations at the group level, and mutate to create new variables. If you chain the functions together with the %>% operator, I believe the solution would look something like this, assuming that you want groups that add up to 20.
install.packages("dplyr") #If you haven't used dplyr before
library(dplyr)
dat %>%
group_by(Group) %>%
slice(sample(1:n())) %>%
mutate(CumArea = cumsum(Area), SubGroup = ceiling(CumArea / 20))

Is there a package that I can use in order to get rules for a target outcome in R

For example In this given data set I would like to get the best values of each variable that will yield a pre-set value of "percentage" : for example I need that the value of "percentage" will be >=0.7 so in this case the outcome should be something like:
birds >=5,1<wolfs<=3 , 2<=snakes <=4
Example data set:
dat <- read.table(text = "birds wolfs snakes percentage
3 8 7 0.50
1 2 3 0.33
5 1 1 0.66
6 3 2 0.80
5 2 4 0.74",header = TRUE
I can't use decision trees as I have a large data frame and I can't see all tree correctly. I tried the *arules* package as but it requires that all variables will be factors and I have mixed dataset of factor,logical and continuous variables and I would like to keep the variables and the Independent variable continues .Also I need "percentage" variable to be the only one that I would like to optimize.
The code that I wrote with *arules* package is this:
library(arules)
dat$birds<-as.factor(dat$birds)
dat$wolfs<-as.factor(dat$wolfs)
dat$snakes<-as.factor(dat$snakes)
dat$percentage<-as.factor(dat$percentage)
rules<-apriori(dat, parameter = list(minlen=2, supp=0.005, conf=0.8))
Thank you
I may have misunderstood the question but to get the maximum value of each variable with the restriction of percentage >= 0.7 you could do this:
lapply(dat[dat$percentage >= 0.7, 1:3], max)
$birds
[1] 6
$wolfs
[1] 3
$snakes
[1] 4
Edit after comment:
So perhaps this is more what you are looking for:
> as.data.frame(lapply(dat[dat$percentage >= 0.7,1:3], function(y) c(min(y), max(y))))
birds wolfs snakes
1 5 2 2
2 6 3 4
It will give the min and max values representing the ranges of variables if percentage >=0.7
If this is completely missing what you are trying to achieve, I may not be the right person to help you.
Edit #2:
> as.data.frame(lapply(dat[dat$percentage >= 0.7,1:3], function(y) c(min(y), max(y), length(y), length(y)/nrow(dat))))
birds wolfs snakes
1 5.0 2.0 2.0
2 6.0 3.0 4.0
3 2.0 2.0 2.0
4 0.4 0.4 0.4
Row 1: min
Row 2: max
Row 3: number of observations meeting the condition
Row 4: percentage of observations meeting the condition (relative to total observations)

R: repeat series of numbers within groups a number of times that differs among groups

I have a data frame that looks something like the one below, which I'll call data frame 1. There is no regular pattern to the number of rows associated with each number in the “tank” column (or the other columns for that matter).
#code for making data frame 1
tank<-c(1,1,2,3,3,3,4,4)
size<-c(2.1,3.5,2.3,4.0,3.3,2.2,1.9,3.0)
mass<-c(6.5,5.5,5.9,7.2,4.9,8.0,9.1,6.3)
df1<-data.frame(cbind(tank,size,mass))
I need to repeat the sequence of values found in the "size" and "mass" columns within each tank. However, the number of repeats for each tank's sequence will differ (again in no particular pattern). I have another data frame (data frame 2) that contains the number of repeats for each tank's sequence, and it looks something like this:
#code for making data frame 2
tank<-c(1,2,3,4)
rpeat<-c(3,1,2,2)
df2<-data.frame(cbind(tank,rpeat))
Ultimately, my goal is to have a data frame like this (see below). Each series of values within a tank is repeated a number of times equal to that specified in data frame 2.
#code for making data frame 3
tank<-c(1,1,1,1,1,1,2,3,3,3,3,3,3,4,4,4,4)
size<-c(2.1,3.5,2.1,3.5,2.1,3.5,2.3,4.0,3.3,2.2,4.0,3.3,2.2,1.9,3.0,1.9,3.0)
mass<-c(6.5,5.5,6.5,5.5,6.5,5.5,5.9,7.2,4.9,8.0,7.2,4.9,8.0,9.1,6.3,9.1,6.3)
df3<-data.frame(cbind(tank,size,mass))
I have figured out a somewhat crude way to do this when each number in the size and mass columns is just repeated a specified number of times (see below) but not how to create the repeating series that I need.
#code to make data frame 4
tank<-c(1,1,1,1,1,1,2,3,3,3,3,3,3,4,4,4,4)
size2<-c(2.1,2.1,2.1,3.5,3.5,3.5,2.3,4.0,4.0,3.3,3.3,2.2,2.2,1.9,1.9,3.0,3.0)
mass2<-c(6.5,6.5,6.5,5.5,5.5,5.5,5.9,7.2,7.2,4.9,4.9,8.0,8.0,9.1,9.1,6.3,6.3)
df4<-data.frame(cbind(tank,size,mass))
To make the above data frame, I took the data frame below, which combines data frames 1 and 2, and applied the code below.
#code to produce data frame 5
tank<-c(1,1,2,3,3,3,4,4)
size<-c(2.1,3.5,2.3,4.0,3.3,2.2,1.9,3.0)
mass<-c(6.5,5.5,5.9,7.2,4.9,8.0,9.1,6.3)
rpeat<-c(3,3,1,2,2,2,2,2)
df5<-data.frame(cbind(tank,size,mass,rpeat))
#code to produce data frame 4 from data frame 5
tank_col <- rep(df5$tank, times = df5$rpeat)
size_col <- rep(df5$size, times = df5$rpeat)
mass_col <- rep(df5$mass, times = df5$rpeat)
goal <-data.frame(cbind(tank_col,size_col,mass_col))
Sorry this is so long, but I have a hard time explaining what I need to do without providing examples. Thanks in advance for any help you can provide.
You can use data.table, and
library(data.table)
# create df1 and df2 as data.tables keyed by tank
DT1 <- data.table(df1, key = 'tank')
DT2 <- data.table(df2, key = 'tank')
# you can now join on tank, and repeat all columns in
# .SD (the subset of the data.table)
DT1[DT2, lapply(.SD, rep, times = rpeat)]
# 1: 1 2.1 6.5
# 2: 1 3.5 5.5
# 3: 1 2.1 6.5
# 4: 1 3.5 5.5
# 5: 1 2.1 6.5
# 6: 1 3.5 5.5
# 7: 2 2.3 5.9
# 8: 3 4.0 7.2
# 9: 3 3.3 4.9
# 10: 3 2.2 8.0
# 11: 3 4.0 7.2
# 12: 3 3.3 4.9
# 13: 3 2.2 8.0
# 14: 4 1.9 9.1
# 15: 4 3.0 6.3
# 16: 4 1.9 9.1
# 17: 4 3.0 6.3
Read the vignettes associated with data.table to get a full understanding of what is going on.
What we are doing is called by-without-by within the vignettes.

Resources