Data Transformations based on certain transformation criteria - r

I want to transform a dataset based on certain conditions. These conditions are given in another dataset. Let me explain it using an example.
Suppose I've a dataset in the following format:
Date Var1 Var2
3/1/2016 8 14
3/2/2016 7 8
3/3/2016 7 6
3/4/2016 10 8
3/5/2016 5 10
3/6/2016 9 15
3/7/2016 2 5
3/8/2016 6 14
3/9/2016 8 15
3/10/2016 8 8
And the following dataset has the transformation conditions and is in the following format:
Variable Trans1 Trans2
Var1 1||2 0.5||0.7
Var2 1||2 0.3||0.8
Now, I want to extract first conditions from transformation table for Var1, 1.0.5, and add 1 to Var1 and multiply it by 0.5. I'll do the same for var2, add by 1 and multiply by 0.3. This transformation will give me new variable Var1_1 and var2_1. I'll do the same thing for the other transformation, which will give me Var1_2 and Var2_2. For Var1_2, the transformation is Var1 sum with 2 and multiplied by 0.7.
After the transformation, the dataset will look like the following:
Date Var1 Var2 Var1_1 Var2_1 Var1_2 Var2_2
3/1/2016 8 14 4.5 4.5 7 11.2
3/2/2016 7 8 4 2.7 6.3 7
3/3/2016 7 6 4 2.1 6.3 5.6
3/4/2016 10 8 5.5 2.7 8.4 7
3/5/2016 5 10 3 3.3 4.9 8.4
3/6/2016 9 15 5 4.8 7.7 11.9
3/7/2016 2 5 1.5 1.8 2.8 4.9
3/8/2016 6 14 3.5 4.5 5.6 11.2
3/9/2016 8 15 4.5 4.8 7 11.9
3/10/2016 8 8 4.5 2.7 7 7

Given that your original data.frame is called df and your conditions table cond1 then we can create a custom function,
funV1Cond1 <- function(x){
t1 <- as.numeric(gsub("[||].*", "", cond1$Trans1[cond1$Variable == "Var1"]))
t2 <- as.numeric(gsub("[||].*", "", cond1$Trans2[cond1$Variable == "Var1"]))
result <- (x$Var1 + t1)*t2
return(result)
}
funV1Cond1(df)
#[1] 4.5 4.0 4.0 5.5 3.0 5.0 1.5 3.5 4.5 4.5
Same way with function 2
funV1Cond2 <- function(x){
t1 <- as.numeric(gsub(".*[||]", "", cond1$Trans1[cond1$Variable == "Var1"]))
t2 <- as.numeric(gsub(".*[||]", "", cond1$Trans2[cond1$Variable == "Var1"]))
result <- (x$Var1 + t1)*t2
return(result)
}
funV1Cond2(df)
#[1] 7.0 6.3 6.3 8.4 4.9 7.7 2.8 5.6 7.0 7.0
Assuming that Trans1 column has 3 conditions i.e. 1, 2, 3 then,
as.numeric(sapply(str_split(cond1$Trans1[cond1$Variable == "Var1"], ','),function(x) x[2]))
#[1] 2
as.numeric(sapply(str_split(cond1$Trans1[cond1$Variable == "Var1"], ','),function(x) x[1]))
#[1] 1
as.numeric(sapply(str_split(cond1$Trans1[cond1$Variable == "Var1"], ','),function(x) x[3]))
#[1] 3
Note that I changed the delimeter to a ','

Related

Setting values to NA in one column based on conditions in another column

Here's a simplified mock dataframe:
df1 <- data.frame(amb = c(2.5,3.6,2.1,2.8,3.4,3.2,1.3,2.5,3.2),
warm = c(3.6,5.3,2.1,6.3,2.5,2.1,2.4,6.2,1.5),
sensor = c(1,1,1,2,2,2,3,3,3))
I'd like to set all values in the "amb" column to NA if they're in sensor 1, but retain the values in the "warm" column for sensor 1. Here's what I'd like the final output to look like:
amb warm sensor
NA 3.6 1
NA 5.3 1
NA 2.1 1
2.8 6.3 2
3.4 2.5 2
3.2 2.1 2
1.3 2.4 3
2.5 6.2 3
3.2 1.5 3
Using R version 4.0.2, Mac OS X 10.13.6
A possible solution, based on dplyr:
library(dplyr)
df1 %>%
mutate(amb = ifelse(sensor == 1, NA, amb))
#> amb warm sensor
#> 1 NA 3.6 1
#> 2 NA 5.3 1
#> 3 NA 2.1 1
#> 4 2.8 6.3 2
#> 5 3.4 2.5 2
#> 6 3.2 2.1 2
#> 7 1.3 2.4 3
#> 8 2.5 6.2 3
#> 9 3.2 1.5 3
Seems to be best handled with the vectorized function is.na<-
is.na(df1$amb) <- df1$sensor %in% c(1) # that c() isn't needed
But to be most general and support tests of proper test for equality among floating point numbers the answer might be:
is.na(df1$amb) <- df1$sensor-1 < 1e-16

Select varying number of top_n for different groups using dplyr

I have the following dataframe. I want to prefer dplyr to solve this problem.
For each zone I want at minimum two values. Value > 4.0 is preferred.
Therefore, for zone 10 all values (being > 4.0) are kept. For zone 20, top two values are picked. Similarly for zone 30.
zone <- c(rep(10,4), rep(20, 4), rep(30, 4))
set.seed(1)
value <- c(4.5,4.3,4.6, 5,5, rep(3,7)) + round(rnorm(12, sd = 0.1),1)
df <- data.frame(zone, value)
> df
zone value
1 10 4.4
2 10 4.3
3 10 4.5
4 10 5.2
5 20 5.0
6 20 2.9
7 20 3.0
8 20 3.1
9 30 3.1
10 30 3.0
11 30 3.2
12 30 3.0
The desired output is as follows
> df
zone value
1 10 4.4
2 10 4.3
3 10 4.5
4 10 5.2
5 20 5.0
6 20 3.1
7 30 3.1
8 30 3.2
I thought of using top_n but it picks the same number for each zone.
You could dynamically calculate n in top_n
library(dplyr)
df %>% group_by(zone) %>% top_n(max(sum(value > 4), 2), value)
# zone value
# <dbl> <dbl>
#1 10 4.4
#2 10 4.3
#3 10 4.5
#4 10 5.2
#5 20 5
#6 20 3.1
#7 30 3.1
#8 30 3.2
can do so
library(tidyverse)
df %>%
group_by(zone) %>%
filter(row_number(-value) <=2 | head(value > 4))

Creating new column names from existing column names using paste function

Assume I have a data frame df with variables A, B and C in it. I would like to create 3 more corresponding columns with names A_ranked, B_ranked and C_ranked. It doesn't matter how I will fill them for the sake of this question, so let's assume that I will set them all to 5. I tried the following code:
for (i in 1:length(df)){
df%>%mutate(
paste(colnames(df)[i],"ranked", sep="_")) = 5
}
I also tried:
for (i in 1:length(df)){
df%>%mutate(
as.vector(paste(colnames(df)[i],"ranked", sep="_")) = 5
}
And:
for (i in 1:length(df)){
df$paste(colnames(df)[i],"ranked", sep="_")) = 5
}
No one them seems to work. Can somebody please tell me what is the correct way to do this?
Here is a data.table option using the iris data set (here we create 4 more columns based on colnames of existing columns).
# data
df <- iris[, 1:4]
str(df)
# new columns
library(data.table)
setDT(df)[, paste(colnames(df), "ranked", "_") := 5][]
# output
Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length ranked _
1: 5.1 3.5 1.4 0.2 5
2: 4.9 3.0 1.4 0.2 5
3: 4.7 3.2 1.3 0.2 5
4: 4.6 3.1 1.5 0.2 5
5: 5.0 3.6 1.4 0.2 5
---
146: 6.7 3.0 5.2 2.3 5
147: 6.3 2.5 5.0 1.9 5
148: 6.5 3.0 5.2 2.0 5
149: 6.2 3.4 5.4 2.3 5
150: 5.9 3.0 5.1 1.8 5
Sepal.Width ranked _ Petal.Length ranked _ Petal.Width ranked _
1: 5 5 5
2: 5 5 5
3: 5 5 5
4: 5 5 5
5: 5 5 5
---
146: 5 5 5
147: 5 5 5
148: 5 5 5
149: 5 5 5
150: 5 5 5
# If you want to fill new columns with different values you can try something like
setDT(df)[, paste(colnames(df), "ranked", "_") := list(Sepal.Length/2,
Sepal.Width/2,
Petal.Length/2,
Petal.Width/2)][]
This should work:
df[paste(names(df), "ranked", sep = "_")] <- 5
df
# A B C A_ranked B_ranked C_ranked
# 1 1 2 3 5 5 5
Data:
df <- data.frame(A = 1, B = 2, C = 3)
Does this help?
dat <- data.frame(A=5,B=5,C=5)
dat %>%
mutate_each(funs(ranked=sum)) %>%
head()

How can i loop through a consecutive window?

I have a df like this:
> df
symbol x1 x2
1 A 3.6 5.2
2 A 10.0 4.8
3 A 5.2 0.2
4 A -10.2 0.4
5 A 5.4 -2.5
6 B 9.9 6.5
7 B 15.8 -1.8
8 B 4.5 -5.9
9 C -2.0 0.5
10 C -10.0 2.6
11 C 7.7 8.9
12 C 10.5 18.5
I want to calculate the r squared between x1 and x2 column by symbol so I want to get a new df like this
symbol r squared
1 A 0.27
2 B 0.30
3 C 0.68
I use ifelse but it isn't working.
for (i in 1:12){
results[i] <- ifelse(df$symbol == symbollist[i], summary(lm(df$x1~df$x2))$r.squared,0)
}
How can I solve this problem in R?
You can use byto perform lm for each symbol:
by(df, df$symbol, function(x) summary(lm(x1~x2, x))$r.squared)
df$symbol: A
[1] 0.07445258
-----------------------------------------------------------------------------------------------------------
df$symbol: B
[1] 0.09014209
-----------------------------------------------------------------------------------------------------------
df$symbol: C
[1] 0.687236
You can use the dplyr package for this. Try:
library(dplyr)
result <- df %>%
group_by(symbol) %>%
summarize(cor(x1, x2))

How to get the proportions of data with respect to two variables in R?

I have 4 columns: Vehicle ID, Vehicle Class, Vehicle Length and Vehicle Width. Every vehicle has a unique vehicle ID (e.g. 2, 4, 5,...) and the data was collected every 0.1 seconds which means that vehicle IDs are repeated in Vehicle ID column for the number of times they were observed. There are three vehicle classes i.e. 1=motorcycles, 2=cars, 3=trucks in the Vehicle Class column and the lengths and widths are in their respective columns against every vehicle ID. I want to subset the data by vehicle class and then find the proportions of each vehicle model (unique length and width) within every class. For example, for the Vehicle Class = 2 i.e. car, I want to find different models of cars (unique length and width) and their proportions with respect to total number of cars. Here is what I have done so far:
To subset data by Vehicle Class
cars <- subset(b, b$'Vehicle class'==2)
trucks <- subset(b, b$'Vehicle class'==3)
motorcycles <- subset(b, b$'Vehicle class'==1)
To find the number of cars
numofcars <- length(unique(cars$'Vehicle ID')) # 2830
numoftrucks <- length(unique(trucks$'Vehicle ID')) # 137
numofmotorcycles <- length(unique(motorcycles$'Vehicle ID'))# 45
The above code worked but I could not find the proportions by using the code below:
by (cars, INDICES=cars$'Vehicle Length', FUN=table(cars$'Vehicle width'))
R gives an error stating that it could not find 'FUN'. Please help me in finding the proportions of each model within all classes of vehicles.
EDIT (Sample Input)
Vehicle ID Vehicle Class Vehicle Length Vehicle Width
2 2 13.5 4.5
2 2 13.5 4.5
2 2 13.5 4.5
2 2 13.5 4.5
3 2 13.5 4.0
3 2 13.5 4.0
3 2 13.5 4.0
3 2 13.5 4.0
4 2 10.0 4.5
4 2 10.0 4.5
4 2 10.0 4.5
4 2 10.0 4.5
5 3 23.0 4.5
5 3 23.0 4.5
5 3 23.0 4.5
5 3 23.0 4.5
6 3 76.5 4.5
6 3 76.5 4.5
6 3 76.5 4.5
6 3 76.5 4.5
6 3 76.5 4.5
7 1 10.0 3.0
7 1 10.0 3.0
7 1 10.0 3.0
7 1 10.0 3.0
8 2 13.5 5.5
8 2 13.5 5.5
8 2 13.5 5.5
8 2 13.5 5.5
Note that in this input: Total number of cars=4, trucks=2, motorcycles=1
Sample Output
Group: cars
VehicleLength VehicleWidth Proportion
13.5 4.5 0.25
13.5 4.0 0.25
13.5 5.5 0.25
23.0 4.5 0.25
Group:trucks
VehicleLength VehicleWidth Proportion
23.0 4.5 0.5
76.0 4.5 0.5
Group: motorcycles
VehicleLength VehicleWidth Proportion
10.0 3.0 1.0
You should have given sample output and sample input to make it easier to answer. From what I understand, you want something along the lines of this -
library(data.table)
dt <- data.table(df)
dt2 <- dt[,
list(ClassLengthWidthFreq = .N),
by = c('VehicleClass','VehicleLength','VehicleWidth')
]
dt2[,
ClassLengthWidthFreqProportion := ClassLengthWidthFreq / sum(ClassLengthWidthFreq),
by = 'VehicleClass'
]
Output -
> dt2
VehicleClass VehicleLength VehicleWidth ClassLengthWidthFreq ClassLengthWidthFreqProportion
1: 2 13.5 4.5 4 0.2500000
2: 2 13.5 4.0 4 0.2500000
3: 2 10.0 4.5 4 0.2500000
4: 3 23.0 4.5 4 0.4444444
5: 3 76.5 4.5 5 0.5555556
6: 1 10.0 3.0 4 1.0000000
7: 2 13.5 5.5 4 0.2500000
If not, then please add sample output and sample input.

Resources