Related
A data frame:
df <- data.frame(
date = seq(ymd('2021-01-01'), ymd('2021-01-31'), by = 1),
ims_x = rnorm(31, mean = 0),
ims_y = rnorm(31, mean = 1),
ims_z = rnorm(31, mean = 2),
blah = 1:31
)
I'd like to mutate 3 new fields (not overwrite), 'ims_x_lagged', 'ims_y_lagged' and 'ims_z_lagged' where each new field corresponds to the original but lagged by one day/row. The names of the new fields would just have '_lagged' appended onto the name of the original and the value would change to be that of it's original in the preceding row.
I could do this manually for each field, but that would be a lot of typing and my real data has many more than 3 fields that need to be lagged.
Something kind of like this, if it's possible to tell what I'm trying to do:
df <- df %>%
mutate_at(vars(contains('ims_')) := lag(vars(contains('ims_')))) # but append '_lagged' to the name
With the new version of dplyr, _at or _all are getting deprecated and in its place, can use the more flexible across. If we don't specify the .names, it will replace the modified column values with the same column name. By specifying the .names, the {.col} - returns the original column name and can add either prefix or suffix as a string.
library(dplyr)
df <- df %>%
mutate(across(starts_with('ims'), lag, .names = "{.col}_lagged"))
race <- data.frame (id = c('xxxxx','yyyyy','zzzzz','ppppp','qqqqq','rrrrr'),
height = c(180,195,165,172,170,181),
weight = c(75,80,60,75,75,80),
bodytype = c('M','L','M','S','M','L'),
country = c('US','CA','MX','MX','AG','US'),
speed = c(100,120,110,95,100,120),
best_id = c('aaaaa','bbbbb','ccccc','ccccc','ddddd','aaaaa')
)
race_best <- data.frame (id = c('aaaaa','bbbbb','ccccc','ddddd','eeeee','fffff'),
height = c(185,195,180,175,182,180),
weight = c(72,79,70,65,68,71),
bodytype = c('M','M','M','S','L','L')
)
race_updated <- data.frame (id = c('xxxxx','yyyyy','zzzzz','ppppp','qqqqq','rrrrr'),
height = c(180,195,165,172,170,181),
weight = c(75,80,60,75,75,80),
bodytype = c('M','L','M','S','M','L'),
country = c('US','CA','MX','MX','AG','US'),
speed = c(100,120,110,95,100,120),
best_id = c('aaaaa','bbbbb','ccccc','ccccc','ddddd','aaaaa'),
best_id_height = c(185,195,180,180,175,185),
best_id_weight = c(72,79,70,70,65,72),
best_id_bodytype = c('M','M','M','M','S','M')
)
I have a dataframe named "race" in which I have few variables which describe the characteristics of the specific racer(height,weight etc.). Variable id is the unique identifier of the racer. There is also a variable called best_id which is the id of the racer with the previous best speed(in the category in which the current racer is).
To explain better I have added another two datasets.
dataset race is original
dataset race_best is the dataset for best racers
dataset race_updated is what I want to achieve. Original dataset(racer) and new variables which define the characteristics of the best racer. e.g. best_id_height is the height of the racer corresponding to the best_id and so on.
It would be great if someone can help me with the problem.
library(dplyr)
race_updated <- left_join(race, race_best, by = c("best_id" = "id") )
An option using lookup join in data.table so that your new columns are also named accordingly:
library(data.table)
setDT(race)[setDT(race_best), on=.(best_id=id),
paste0("best_", names(race_best)) := mget(paste0("i.", names(race_best)))]
race
I have an R dataframe that has N rows and 6 columns. For exemplification I will use following column names: "theDate","theIndex","Component_1","Component_2","Component_3","Component_4"
I am trying to convert it to a 3 dimensional array, with first dimension corresponding to "theDate", second dimension to "theIndex" and third dimension to the values of the components.
To give an example, the expression NewArray[2,4,3] will display the 2-nd element from "theDate" column, the 4-th element from "theIndex" column and the value of Component_3 that is on same row as the 2-nd value from "theDate" column and the 4-th value from "theIndex" column.
I have looked into using abind, narray, and a combination of apply/split/abind, without full success.
The closest question I found on SO is this one: Link SO, but I could not generalize it along same lines as the answer found there.
The desired multidimensional array has dimensions (5, 7, 4). First two dimensions are corresponding to 5 distinct elements in "theDate" column and to 7 distinct elements in "theIndex" column, while the third dimension corresponds to the 4 additional columns in dataframe: Component_1,...,Component_4)
Here is a small piece of code to create the dataframe, and to create an empty multidimensional array of desired dimensions
EDIT: I have also added a piece of code which appears to work, and I would be interested in other solutions
`%>%` <- dplyr::`%>%`
base::set.seed(seed = 1785)
setOfComponents <-c("Component_1","Component_2","Component_3","Component_4")
setOfDates <- c(234, 342, 456, 678, 874)
setOfIndices <- c(2, 7, 11, 15, 24, 36, 56)
numIndices <- length(setOfIndices)
numDates <- length(setOfDates)
numElementsComponent <- numIndices * numDates
theDF <- base::data.frame(
theDate = c(base::rep(x = setOfDates[1],times = numIndices),
base::rep(x = setOfDates[2],times = numIndices),
base::rep(x = setOfDates[3],times = numIndices),
base::rep(x = setOfDates[4],times = numIndices),
base::rep(x = setOfDates[5],times = numIndices)),
theIndex = base::rep(x = setOfIndices,times = numDates),
Component_1 = stats::runif(n = numElementsComponent, min = 0, max = 100),
Component_2 = stats::runif(n = numElementsComponent, min = 0, max = 100),
Component_3 = stats::runif(n = numElementsComponent, min = 0, max = 100),
Component_4 = stats::runif(n = numElementsComponent, min = 0, max = 100) )
theNewDF <- theDF %>%
tidyr::gather(key = "IdxComp", value = "ValueComp", Component_1, Component_2, Component_3, Component_4)
newArray <- array(theNewDF$ValueComp, dim = c(length(unique(theDF$theDate)),length(unique(theDF$theIndex)),length(setOfComponents)))
Check out the tidyr package.
I think you want the gather function.
See the package, or the descriptions here:
http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/
However, it says "undefined columns selected." I am really bad at R and I am trying to select the 1st column. Column 0 is the auto-generated row number.
cd <- data.frame(combineData$DogName)
figPath = system.file("dog.png",package = "wordcloud2")
wordcloud2(data = cd[,"combineData.DogName"], figPath = figPath, size = 1.5,color = "skyblue")
I've tried removing column 0 and tried a lot of other ways. Any suggestions?
Two comments:
"Column 0 is the auto-generated row number" There is no 0 column in an R data.frame.
I don't understand what's the purpose of cd which contains only a single column from combineData.
From ?wordcloud2
data A data frame including word and freq in each column
So the data argument of wordcloud2 needs to be a data.frame with two columns.
Here is a minimal reproducible example based on some sample data
library(wordcloud2)
set.seed(2018)
df <- aggregate(
freq ~ word,
data = data.frame(
word = sample(letters[1:10], 100, replace = T),
freq = sample(10:100, 100, replace = T)),
FUN = sum)
wordcloud2(df)
I have a data table with 3 columns: Field1, Field2 and Value.
For each attribute in Field2, I want to find the attribute in Field1 that corresponds to the largest sum of Value (ie there are multiple Field1 / Field2 rows in the data table).
When I try this: x[,.(Field1 = Field1[which.max(sum(Value))]),.(Field2)] I seem to be getting the first Field1 row for each Field2 rather than the row corresponding to the max sum of Value.
As an extension, what if you wanted to provide both the sum of value, the total number of rows and the Field1 value corresponding to the largest sum across the Value field within Field2?
Below is a reproducible code.
library(data.table)
#Set random seed
set.seed(2017)
#Create a table with the attributes we need
x = data.table(rbind(data.frame(Field1 = 1:12,Field2 = rep(1:3, each = 4), Value = runif(12)),
data.frame(Field1 = 1:12,Field2 = rep(1:3, each = 4), Value = runif(12))))
#Let's order by Field2/ Field1 / Value
x = x[order(Field2,Field1,Value)]
#Check
print(x)
# This works, but requires 2 steps which can complicate things when needing
# to pull other attributes too.
(x[,.(Value = sum(Value)),.(Field2,Field1)][,.SD[which.max(Value)],.(Field2)])
#This instead provides the row corresponding to the largest Value.
(x[,.(Field1 = Field1[which.max(Value)]),.(Field2)])
# This is what I was ideally looking for but it only returns the first row of the attribute
# regardless of the value of Value, or the corresponding sum.
(x[,.(Field1 = Field1[which.max(sum(Value))]),.(Field2)])
# This works but seems clumsy
(x[,
.SD[, .(RKCNT=length(.I),TotalValue=sum(Value)), .(Field1)]
[,.(RKCNT = sum(RKCNT), TotalValue = sum(TotalValue),
Field1 = Field1[which.max(TotalValue)])],
.(Field2)])
We can use
x[, .SD[, sum(Value), Field1][which.max(V1)], Field2]
Which is concise and thus somewhat easier to read. But does not give any performance improvement.