I'm not sure if the title is worded well, but here is the situation:
I have a meta data dataset, which can have any number of rows in it, e.g.:
Control_DF <- cbind.data.frame(
Scenario = c("A","B","C")
,Variable = c("V1","V2","V3")
,Weight = c("w1","w2","w3")
)
Using the data contained in Control_DF, I want to create a new version of each Variable on my main dataset, where I multiply the variable by the weight. So if my main dataset looks like this:
Main_Data <- cbind.data.frame(
V1 = c(1,2,3,4)
,V2 = c(2,3,4,5)
,V2 = c(3,4,5,6)
,w1 = c(0.1,0.5,1,0.8)
,w2 = c(0.2,1,0.3,0.6)
,w2 = c(0.3,0.7,0.1,0.2)
)
Then, in open code, what I want to do looks like this:
New_Data <- Main_Data %>%
mutate(
weighted_V1 = V1 * w1
,weighted_V2 = V2 * w2
,weighted_V3 = V3 * w3
)
However, I need a way of not hard coding this, and such that the number of variables being referenced is arbitrary.
Can anyone help me?
In base R with lapply, Map and cbind you could do as follows:
# with Control_DF create a list with pairs of <varName,wgt>
controlVarList = lapply(Control_DF$Scenario,function(x)
as.vector(as.matrix(Control_DF[Control_DF$Scenario==x,c("Variable","Weight")] ))
)
controlVarList
#[[1]]
#[1] "V1" "w1"
#
#[[2]]
#[1] "V2" "w2"
#
#[[3]]
#[1] "V3" "w3"
# A custom function for multiplication of both columns
fn_weightedVars = function(x) {
# x = c("V1","w1"); hence x[1] = "V1",x[2] = "w2"
# reference these columns in Main_Data and do scaling
wgtedCol = matrix(Main_Data[,x[1]] * Main_Data[,x[2]],ncol=1)
#rename as required
colnames(wgtedCol)= paste0("weighted_",x[1])
#return var
wgtedCol
}
#call function on each each list element
scaledList = Map(fn_weightedVars ,controlVarList)
Output:
scaledDF = do.call(cbind,scaledList)
#combine datasets
New_Data = data.frame(Main_Data,scaledDF)
New_Data
# V1 V2 V3 w1 w2 w3 weighted_V1 weighted_V2 weighted_V3
#1 1 2 3 0.1 0.2 0.3 0.1 0.4 0.9
#2 2 3 4 0.5 1.0 0.7 1.0 3.0 2.8
#3 3 4 5 1.0 0.3 0.1 3.0 1.2 0.5
#4 4 5 6 0.8 0.6 0.2 3.2 3.0 1.2
Related
I need to import data files with various column numbers. Finally, the code should be used by other co-workers being not very familiar with R. So it should be robust and without warning messages preferably. The main problem is that the headder is always ending with an additional "," which does not appear in the data below. Beside a whole bunch of unused columns the required columns are always labeled in the same way. I.e. there is always a certain string within the column name, but not necessarily the whole column name is identical.
The example code is a very simple approximation to my files. First, I would like to get rid of the error message due to the errornous comma at the end of the headder. Something like skip_col = ncol(headder). And second, I would like to read only columns with "*des*" within the column name.
My approach to deal with it looks simple in this simplified example but is not very satisfying in my more complex code.
library(tidyverse)
read_csv("date,col1des,col1foo,col2des,col3des,col2foo,col3foo,
2015-10-23T22:00:00Z,0.6,-1.5,-1.3,-0.5,1.8,0
2015-10-23T22:10:00Z,-0.5,-0.6,1.5,0.1,-0.3,0.3
2015-10-23T22:20:00Z,0.1,0.2,-1.6,-0.1,-1.4,-0.4
2015-10-23T22:30:00Z,1.7,-1.2,-0.2,-0.4,0.3,0.3")
if (length(grep("des", names(data))) > 0) {
des <- data[grep("des", names(data))]
des <- bind_cols(date = data$date, des)
}
So in my full code, I get the following warning messages:
1. Missing column names filled in: 'X184' [184]
2. Duplicated column names deduplicated: [long list of unrequired columns with dublicated names]
I would appreciate a solution within the tidyverse. As far as I found it is not possible to use regular expressions directly within the read_csv call to specify column names, right? So perhaps the only way is to read the headder first and build the cols() call out of that. But this exceeds my R knowledge.
Edit:
I wonder if something like this is possible:
headline <- "date,col1des,col1foo,col2des,col3des,col2foo,col3foo,"
head <- headline %>% strsplit(",") %>% unlist(use.names = FALSE)
head_des <- head[grep("des", head)]
data <- read_csv("mydata.csv", col_types = cols_only(head_des[1] = "d", head_des[2] = "d"))
I would like to grep() the column names befor reading the whole data.
Edit no. 2
In reaction to your comment;
This works with your data-string:
library(tidyverse)
yourData <- "date,col1des,col1foo,col2des,col3des,col2foo,col3foo,
2015-10-23T22:00:00Z,0.6,-1.5,-1.3,-0.5,1.8,0
2015-10-23T22:10:00Z,-0.5,-0.6,1.5,0.1,-0.3,0.3
2015-10-23T22:20:00Z,0.1,0.2,-1.6,-0.1,-1.4,-0.4
2015-10-23T22:30:00Z,1.7,-1.2,-0.2,-0.4,0.3,0.3"
data <- suppressWarnings(read_csv(yourData))
header <- names(data)
colList <- ifelse(str_detect(header,'des'),'c','_') %>% as.list
suppressWarnings(read_csv(yourData,col_types = do.call(cols_only, colList)))
#> # A tibble: 4 x 3
#> col1des col2des col3des
#> <chr> <chr> <chr>
#> 1 0.6 -1.3 -0.5
#> 2 -0.5 1.5 0.1
#> 3 0.1 -1.6 -0.1
#> 4 1.7 -0.2 -0.4
EDIT
Trying to accomodate your edited wishes and with the help of this Post:
library(tidyverse)
header <- suppressWarnings(readLines('file.csv')[1]) %>%
str_split(',',simplify = T)
colList <- ifelse(str_detect(header,'des'),'c','_') %>% as.list
suppressWarnings(read_csv(file = 'file.csv',col_types = do.call(cols_only, colList)))
#> # A tibble: 4 x 3
#> col1des col2des col3des
#> <chr> <chr> <chr>
#> 1 0.6 -1.3 -0.5
#> 2 -0.5 1.5 0.1
#> 3 0.1 -1.6 -0.1
#> 4 1.7 -0.2 -0.4
This is the most robust, most tidyverse way, I could come up with:
library(tidyverse)
file <- suppressWarnings(readLines('file.csv')) %>%
str_split(',')
dims <- file %>% map_int(~length(.))
if(any(dims != median(dims))){
file[[which(dims != median(dims))]] <- file[[which(dims != median(dims))]][1:median(dims)]
}
data <- file %>% map_chr(~paste(.,collapse = ',')) %>%
paste(., sep = '\n') %>% read_csv
(data <- data %>% select(which(str_detect(names(data), pattern = 'des'))))
#> # A tibble: 4 x 3
#> col1des col2des col3des
#> <dbl> <dbl> <dbl>
#> 1 0.6 -1.3 -0.5
#> 2 -0.5 1.5 0.1
#> 3 0.1 -1.6 -0.1
#> 4 1.7 -0.2 -0.4
Where file.csv contains your data.
I have a dataset which includes multiple latitude and longitude points within the same column and also has columns with additional variables, like so:
What data currently looks like
What I would like to do is extract the numbers in multiples of 2 (i.e. 144.81803494458699788 and -37.80978699721590175 then 144.8183146450259926 -37.80819285880839686) into their own rows. The new rows will also duplicate the rest of the original row from which they came i.e.
What I would like the data to look like
I'm pretty new to R hence, perhaps, what might see like a basic question to you all. Update: I've now used
new$latlongs <- str_extract_all(roadchar$X.wkt_geom, "(?>-)*[0-9]+\\.[0-9]+")
and have the numbers/latlongs extracted including the negative sign :)
You can use a loop that combines gsub and strsplit:
## The data.frame
df <- data.frame ("Polyline" = c("MultiLineString((1.1 - 1.1, 2.2 - 2.2))",
"MultiLineString((3.3 - 3.3, 4.4 - 4.4, 5.5 - 5.5))"),
t(matrix(c(LETTERS[c(1:3,24:26)]), 3,
dimnames = list(c("Char1", "Char2", "Char3")))),
stringsAsFactors = FALSE)
# Polyline Char1 Char2 Char3
# 1 MultiLineString((1.1 - 1.1, 2.2 - 2.2)) A B C
# 2 MultiLineString((3.3 - 3.3, 4.4 - 4.4, 5.5 - 6.6)) X Y Z
## Function for splitting the line
split.polyline <- function(line, df) {
## Removing the text and brackets
cleaned_line <- gsub("\\)\\)", "", gsub("MultiLineString\\(\\(", "", as.character(df$Polyline[line])))
## Splitting the line
split_line <- strsplit(cleaned_line, split = ", ")[[1]]
## Making the line into a data.frame
df_out <- data.frame("Polyline" = split_line,
matrix(rep(df[line, -1], length(split_line)),
nrow = length(split_line), byrow = TRUE,
dimnames = list(c(), names(df)[-1]))
)
return(df_out)
}
## You can use the function like this for the first row for example
df_out <- split.polyline(1, df)
# Polyline Char1 Char2 Char3
# 1 1.1 - 1.1 A B C
# 2 2.2 - 2.2 A B C
## Or loop through all the rows
for(line in 2:nrow(df)){
df_out <- rbind(df_out, split.polyline(line, df))
}
# Polyline Char1 Char2 Char3
# 1 1.1 - 1.1 A B C
# 2 2.2 - 2.2 A B C
# 3 3.3 - 3.3 X Y Z
# 4 4.4 - 4.4 X Y Z
# 5 5.5 - 5.5 X Y Z
I have a next task
a = data.frame(a= c(1,2,3,4,5,6)) # dataset
range01 <- function(x){(x-min(a$a))/(max(a$a)-min(a$a))} # rule for scale
b = data.frame(a = 6) # newdaset
lapply(b$a, range01) # we can apply range01 for this dataset because we use min(a$a) in the rule
But how can I apply this when i have many columns in my dataset? like below
a = data.frame(a= c(1,2,3,4,5,6))
b = data.frame(b= c(1,2,3,3,2,1))
c = data.frame(c= c(6,2,4,4,5,6))
df = cbind(a,b,c)
df
new = data.frame(a = 1, b = 2, c = 3)
Of course I can make rules for every variable
range01a <- function(x){(x-min(df$a))/(max(df$a)-min(df$a))}
But it's very long way. How to make it convenient?
You can redefine your scale function so it takes two arguments; One to be scaled and one the scaler as follows, and then use Map on the two data frames:
scale_custom <- function(x, scaler) (x - min(scaler)) / (max(scaler) - min(scaler))
Map(scale_custom, new, df)
#$a
#[1] 0
#$b
#[1] 0.5
#$c
#[1] 0.25
If you need the data frame as result:
as.data.frame(Map(scale_custom, new, df))
# a b c
#1 0 0.5 0.25
You can exploit the fact that the column names of new and df are same. Could be helpful if the order of the columns in the two dataframes is not the same.
sapply(names(new), function(x) (new[x]-min(df[x]))/(max(df[x])-min(df[x])))
#$a.a
#[1] 0
#$b.b
#[1] 0.5
#$c.c
#[1] 0.25
to put in data.frame
data.frame(lapply(names(new), function(x) (new[x]-min(df[x]))/(max(df[x])-min(df[x]))))
# a b c
#1 0 0.5 0.25
This is my data (imagine I have 1050 rows of data shown below)
ID_one ID_two parameterX
111 aaa 23
222 bbb 54
444 ccc 39
My code then will divide the rows into groups of 100 (there will be 10 groups of 100 rows).
I then want to get the summary statistics per group. (not working)
After that I want to place the summary statistics in a data frame to plot them.
For example, put all 10 means for parameterX in a dataframe together, put all 10 std dev for parameterX in the same a data frame together etc
The following code is not working:
#assume data is available
dataframe_size <- nrow(thedata)
group_size <- 100
number_ofgroups <- round(dataframe_size / group_size)
#splitdata into groups of 100
split_dataframe_into_groups <- function(x,y)
0:(x-1) %% y
list1 <- split(thedata, split_dataframe_into_groups(nrow(thedata), group_size))
#print data in the first group
list1[[1]]$parameterX
#NOT WORKING!!! #get summary stat for all 10 groups
# how to loop through all 10 groups?
list1_stat <- do.call(data.frame, list(mean = apply(list1[[1]]$parameterX, 2, mean),
sd = apply(list1[[1]]$parameterX, 2, sd). . .))
the error message is always:
Error in apply(...) dim(x) must have a positive length
That makes NO sense because when I run this code, There is clearly a positive length (data exists)
#print data in the first group
list1[[1]]$parameterX
#how to put all means in a dataframe?
# how to put all standard deviations in the same dataframe
ex df1 <- mean(2,2,3,4,7,2,4,,9,8,9),
sd (0.1, 3 , 0.5, . . .)
dplyr is so good for this kind of thing. If you create a new column that assigns a 'group' ID based on row location, then you can summarize each group very easily. I use an index to assist in assigning group IDs.
install.packages('dplyr')
library(dplyr)
## Create index
df$index <- 1:nrow(df)
## Assign group labels
df$group <- paste("Group", substr(df$index, 1, 1), sep = " ")
df[df$index <= 100, 'group'] <- "Group 0"
df[df$index > 1000, 'group'] <- paste("Group", substr(df$index, 1, 2), sep = " ")
df[df$index > 10000, 'group'] <- paste("Group", substr(df$index, 1, 3), sep = " ")
## Get summaries
df <- group_by(df, group)
summaries <- summarise(df, avg = mean(parameterX),
minimum = min(parameterX),
maximum = max(parameterX),
med = median(parameterX),
Mode = mode(parameterX))
... and so on.
Hope this helps.
I think this might be a good place to use tapply. there is an excellent summary here! One path forward might be an extension of the below:
df <- data.frame(id= c(rep("AA",10),rep("BB",10)), x=runif(20))
do.call("rbind", tapply(df$x, df$id, summary))
I think this is what you want :
require(dplyr)
dt<-rbind(iris,iris,iris)
dataframe_size <- nrow(dt)
group_size <- 100
number_ofgroups <- round(dataframe_size / group_size)
df<-dt %>%
# Creating the "bins" column using mutate
mutate(bins=cut(seq(1:dataframe_size),breaks=number_ofgroups)) %>%
# Aggregating the summary statistics by the bins variable
group_by(bins) %>%
# Calculating the mean
summarise(mean.Sepal.Length = mean( Sepal.Length))
head(dt)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
df
bins mean.Sepal.Length
(fctr) (dbl)
1 (0.551,113] 5.597345
2 (113,226] 5.755357
3 (226,338] 5.919643
4 (338,450] 6.100885
This question already has answers here:
R fill vector efficiently
(4 answers)
Closed 6 years ago.
I have a vector of zeros, say of length 10. So
v = rep(0,10)
I want to populate some values of the vector, on the basis of a set of indexes in v1 and another vector v2 that actually has the values in sequence. So another vector v1 has the indexes say
v1 = c(1,2,3,7,8,9)
and
v2 = c(0.1,0.3,0.4,0.5,0.1,0.9)
In the end I want
v = c(0.1,0.3,0.4,0,0,0,0.5,0.1,0.9,0)
So the indexes in v1 got mapped from v2 and the remaining ones were 0. I can obviously write a for loop but thats taking too long in R, owing to the length of the actual matrices. Any simple way to do this?
You can assign it this way:
v[v1] = v2
For example:
> v = rep(0,10)
> v1 = c(1,2,3,7,8,9)
> v2 = c(0.1,0.3,0.4,0.5,0.1,0.9)
> v[v1] = v2
> v
[1] 0.1 0.3 0.4 0.0 0.0 0.0 0.5 0.1 0.9 0.0
You can also do it with replace
v = rep(0,10)
v1 = c(1,2,3,7,8,9)
v2 = c(0.1,0.3,0.4,0.5,0.1,0.9)
replace(v, v1, v2)
[1] 0.1 0.3 0.4 0.0 0.0 0.0 0.5 0.1 0.9 0.0
See ?replace for details.