Least Absolute Deviation in R [duplicate] - r

This question already has answers here:
Grouping functions (tapply, by, aggregate) and the *apply family
(10 answers)
Closed 4 years ago.
I have a LIST of dataframes. Each dataframe has the same numer of rows and columns.
Here is a sample dataframe:
df
TIME AMOUNT
20 456
30 345
15 122
12 267
Here is the expected RESULT:
I would like to count the AMOUNT_NORM column where
each value in the AMOUNT column was divided by the sum of all values in the AMOUNT column.
df
TIME AMOUNT AMOUNT_NORM
20 456 0.38
30 345 0.29
15 122 0.1
12 267 0.22

The following should do what you want
library(tidyverse)
df %>% mutate(AMOUNT_NORM = AMOUNT/SUM(AMOUNT))
EDIT: didn't read the list of dataframes bit. in this case you just do:
lapply(your_df_list, function(x) {
x %>% mutate(AMOUNT_NORM = AMOUNT/SUM(AMOUNT))
})

Related

creating multi rows depend on special conditions [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 3 years ago.
I have data.frame as follows :
duration classlabel
100 W
120 1
390 2
30 3
30 2
150 3
30 4
60 3
60 4
30 3
120 4
30 3
120 4
I have to make a number of lines according to duration with the class label in R. as an example, I have to make 100 rows with the class label 'W', and then 120 rows with the class label '2', etc.
anyone, can help me to solve this problem?
An option would be uncount
library(tidyr)
uncount(df1, duration, .remove = FALSE)
Or with rep from base R to replicate the sequence of rows by 'duration' column and expand the rows based on the numeric index
df1[rep(seq_len(nrow(df1)), df1$duration),]

R Concatenate column in data frame with one value/string [duplicate]

This question already has answers here:
How to add leading zeros?
(8 answers)
Closed 4 years ago.
I am trying to concatenate some data in a column of a df, with "0000"
I tried to use paste() in a loop, but it becomes very performance heavy, as I have +2.000.000 rows. Thus, it takes forever.
Is there a smart, less performance heavy way to do it?
#DF:
CUSTID VALUE
103 12
104 10
105 15
106 12
... ...
#Desired result:
#DF:
CUSTID VALUE
0000103 12
0000104 10
0000105 15
0000106 12
... ...
How can this be achieved?
paste is vectorized so it'll work with a vector of values (i.e. a column in a data frame. The following should work:
DF <- data.frame(
CUSTID = 103:107,
VALUE = 13:17
)
DF$CUSTID <- paste0('0000', DF$CUSTID)
Should give you
CUSTID VALUE
1 0000103 13
2 0000104 14
3 0000105 15
4 0000106 16
5 0000107 17

Find unique rows [duplicate]

This question already has an answer here:
Extracting only unique rows from data frame in R [duplicate]
(1 answer)
Closed 5 years ago.
This seems so simple, but I can't figure it out.
Given this data frame
df=data.frame(
x = c(12,12,165,165,115,148,148,155,155,521),
y = c(54,54,122,122,215,108,108,655,655,151)
)
df
x y
1 12 54
2 12 54
3 165 122
4 165 122
5 115 215
6 148 108
7 148 108
8 155 655
9 155 655
10 521 151
Now, how can I get the rows that only exists once. That is row 5 and 10. The order of rows can be totally arbitrary, so checking for the "next" row is not an option. I tried many things but nothing worked on my data.frame which has ~40k rows.
I had one solution working on a subset (~1k rows) of my data.frame which took 3 minutes to process. Thus, my solution would require 120 minutes on my original data.frame which is not appropiate. Can somebody help?
Check duplicated from the beginning and end of the data frame, if none returns true, then select it:
df[!(duplicated(df) | duplicated(df, fromLast = TRUE)),]
# x y
#5 115 215
#10 521 151
A solution with table
library(dplyr)
table(df) %>% as.data.frame %>% subset(Freq ==1) %>% select(-3)
or with base as you said in comments you prefer not to load packages:
subset(as.data.frame(table(df)),Freq ==1)[,-3]
Also I think data.table is very fast for big data sets and filtering, so this may be worth trying too as you mentionned speed:
df2 <- copy(df)
df2 <- setDT(df2)[, COUNT := .N, by='x,y'][COUNT ==1][,c("x","y")]
A solution using dplyr. df2 is the final output.
library(dplyr)
df2 <- df %>%
count(x, y) %>%
filter(n == 1) %>%
select(-n)
Another base R solution that uses ave to calculate the total number of occurrences for each row and subsets only those that occur 1 time. It could also be modified for subsetting rows that occur a specific number of times.
df[ave(1:NROW(df), df, FUN = length) == 1,]
# x y
#5 115 215
#10 521 151

aggregate over multiple columns [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 6 years ago.
Hey I have some data looks like this:
ExpNum Compound Peak Tau SS
1 a 100 30 50
2 a 145 23 45
3 b 78 45 56
4 b 45 43 23
5 c 344 23 56
Id like to fund the mean based on Compound name
What I have
Norm_Table$Norm_Peak = (aggregate(data[[3]],by=list(Compound),FUN=normalization))
This is fine and I have this coding repeating 3 times just changing the data[[x]] number. Would lapply work here? or a for loop?
A dplyr solution:
library(dplyr)
data %>%
group_by(Compound) %>%
summarize_each(funs(mean), -ExpNum)

Random selection based on a variable in a R dataframe [duplicate]

This question already has answers here:
Take the subsets of a data.frame with the same feature and select a single row from each subset
(3 answers)
Closed 7 years ago.
I have a data frame with 1000 columns. It is a dataset of animals from different breeds. However I have more animals from some breeds. So what I want to do is to select a random sample of those breeds with more animals and make all breeds with the same number of observations.
In details: I have 400 Holstein animals, 300 Jersey, 100 Hereford and 150 Nelore and 50 Canchim. What I want to do is to randomly select 50 animals from each breed. So I would have a total of 250 animals at the end. I know how to randomly select using runif, however I am not sure how I can apply that in my case.
My data looks like:
Breed ID Trait1 Trait2 Trait3
Holstein 1 11 22 44
Jersey 2 22 33 55
Nelore 3 33 44 66
Nelore 4 44 55 77
Canchim 5 55 66 88
I have tried:
Data = data[!!ave(seq_along(data$Breed), unique(data$Breed), FUN=function(x) sample(x, 50) == x),]
However, it does not work and I am not allowed to install the package dplyr in the server that I am using.
Thank in advance.
You can split your animals data frame on the breed, and then apply a custom function to each chunk which will randomly extract 50 rows:
animals.split <- split(animals, animals$Breed)
animals.list <- lapply(animals.split, function(x) {
y <- x[sample(nrow(x), 50), ]
return(y)
}
result <- unsplit(animals.list, f = animals$Breed)

Resources