A data.frame df1 is queried (fuzzy match) against another data.frame df2 with agrep. Via iterating over its output (a list called matches holding the row number of the respective matches in df2), df1 is populated with affiliated values from df2.
The goal is a function that is passed to mapply; however, in all my attempts df1 remains unchanged.
In a for-loop, the code works as expected and populates df1 with the affiliated variables from df2. Still, I would be interested how to solve this with a function that is passed to mapply.
First, the two data.frames:
df1 <- structure(list(Species = c("Alisma plantago-aquatica", "Alnus glutinosa",
"Carex davalliana", "Carex echinata",
"Carex elata"),
CheckPoint = c(NA, NA, NA, NA, NA),
L = c(NA, NA, NA, NA, NA),
R = c(NA, NA, NA, NA, NA),
K = c(NA, NA, NA, NA, NA)),
row.names = c(NA, 5L), class = "data.frame")
df2 <- structure(list(Species = c("Alisma gramineum", "Alisma lanceolatum",
"Alisma plantago-aquatica", "Alnus glutinosa",
"Alnus incana", "Alnus viridis",
"Carex davalliana", "Carex depauperata",
"Carex diandra", "Carex digitata",
"Carex dioica", "Carex distans",
"Carex disticha", "Carex echinata",
"Carex elata"),
L = c(7L, 7L, 7L, 5L, 6L, 7L, 9L, 4L, 8L, 3L, 9L, 9L, 8L,
8L, 8L),
R = c(7L, 7L, 5L, 5L, 4L, 3L, 4L, 7L, 6L, NA, 4L, 6L, 6L,
NA, NA),
K = c(6L, 2L, NA, 3L, 5L, 4L, 4L, 2L, 7L, 4L, NA, 3L, NA,
3L, 2L)),
row.names = seq(1:15), class = "data.frame")
Then, fuzzy match by Species:
matches <- lapply(df1$Species, agrep, x = df2$Species, value = FALSE,
max.distance = c(deletions = 0,
insertions = 1,
substitutions = 1))
Populating df1 with the values from df2 works as expected:
for (i in 1:dim(df1)[1]){
df1[i, 2:5] <- df2[matches[[i]], ]
}
In contrast to my approach with mapply that does return the correct values, although as a list of dissasembled values that are never written into df1. No combination (with or without return(df1), writing it into another variable nor desparate attempts with the state of SIMPLIFY or USE.NAMES) yielded the desired results.
populatedf1 <- function(matches, index){
df1[index, 2:5] <- df2[matches, ]
#return(df1)
}
mapply(populatedf1, matches, seq_along(matches), SIMPLIFY = FALSE,
USE.NAMES = FALSE)
Would be great if someone knows the solution or could point me into a certain direction, thanks! :)
Actually, you would not need any loop here (for or mapply) if you replace lapply with sapply (so that it returns a vector instead of list) and then do a direct assignment.
matches <- sapply(df1$Species, agrep, x = df2$Species, value = FALSE,
max.distance = c(deletions = 0,
insertions = 1,
substitutions = 1))
df1[, 2:5] <- df2[matches,]
df1
# Species CheckPoint L R K
#1 Alisma plantago-aquatica Alisma plantago-aquatica 7 5 NA
#2 Alnus glutinosa Alnus glutinosa 5 5 3
#3 Carex davalliana Carex davalliana 9 4 4
#4 Carex echinata Carex echinata 8 NA 3
#5 Carex elata Carex elata 8 NA 2
As far as your approach is concerned you can use Map or mapply with SIMPLIFY = FALSE and bring the list of dataframes into one dataframe using do.call and rbind and then assign.
df1[, 2:5] <- do.call(rbind, Map(populatedf1, matches, seq_along(matches)))
Related
I have a dataframe with 3 columns: ID, F and M. I want to join the values for F and M into one row based on ID, while now most of them are in two separate rows with NAs instead.
I do have unfortunately some duplicate rows and the data is still a bit messy at the moment (see example below)
I tried this, but I get Error: Expecting a single value: [extent=2].
test2 <- test %>% mutate(grouped_id = row_number()) %>%
group_by(BroodID) %>%
summarise_each(funs(na.omit))
Here is a reproducible example of what my data looks like:
structure(list(ID = c(2010.3, 2010.3, 2010.3, 2010.3, 2010.33,
2010.34, 2010.38, 2010.38, 2010.39, 2010.39, 2010.4, 2010.4,
2010.4, 2010.4, 2010.4, 2010.41, 2010.41, 2010.42, 2010.42, 2010.44,
2010.44, 2010.46, 2010.46), F = structure(c(5L, 5L, 12L, 12L,
11L, 8L, NA, 3L, NA, 1L, NA, 2L, 2L, 6L, 6L, NA, 7L, NA, 9L,
NA, 4L, NA, 10L), .Label = c("T206434", "T206553", "T931169",
"T931286", "T961275", "V470937", "X250041", "X250109", "X250195",
"X250568", "X251067", "X251069"), class = "factor"), M = structure(c(2L,
2L, 11L, 11L, 6L, NA, 9L, NA, 10L, NA, 1L, 1L, 4L, 4L, NA, 8L,
NA, 3L, NA, 7L, NA, 5L, NA), .Label = c("T206824", "T206994",
"T960191", "T961486", "X250567", "X250779", "X250851", "X251046",
"X251066", "X251074", "X251116"), class = "factor")), row.names = c(NA,
23L), class = "data.frame")
I'd like the rows where the values for F and M are split into two rows to be merged into one row based on ID.
We could use unite with na.rm = TRUE to remove NA values and use distinct to have only unique rows.
library(dplyr)
test %>%
mutate_at(2:3, as.character) %>%
tidyr::unite(combined, F, M, na.rm = TRUE, sep = ",") %>%
distinct()
# ID combined
#1 2010.30 T961275,T206994
#2 2010.30 X251069,X251116
#3 2010.33 X251067,X250779
#4 2010.34 X250109
#5 2010.38 X251066
#6 2010.38 T931169
#7 2010.39 X251074
#8 2010.39 T206434
#9 2010.40 T206824
#10 2010.40 T206553,T206824
#11 2010.40 T206553,T961486
#12 2010.40 V470937,T961486
#13 2010.40 V470937
#14 2010.41 X251046
#15 2010.41 X250041
#16 2010.42 T960191
#17 2010.42 X250195
#18 2010.44 X250851
#19 2010.44 T931286
#20 2010.46 X250567
#21 2010.46 X250568
If we want to further summarise by ID, we can do
test %>%
mutate_at(2:3, as.character) %>%
tidyr::unite(combined, F, M, na.rm = TRUE, sep = ",") %>%
distinct() %>%
group_by(ID) %>%
summarise(combined = toString(combined))
I have a survey dataset that I imported as an SAS file but it did not include the text labels that are associated with the numeric codes in the dataset.
I'm trying to apply the factor function to all variables and then have the respective levels and labels for each variable.
I have a main dataframe with the actual data, and then a second dataframe with the text labels corresponding to each value for each variable.
So, for example, the variable column names in the main dataset are A1, B1, C1, D1. The second dataframe with the labels is listed below with dummy text. And for each variable, there are varying numbers of values that need text labels.
labels_list <- structure(list(VariableName = c("A1", "A1", "A1", "B1", "B1",
"B1", "B1", "C1", "C1", "C1", "C1", "C1", "D1", "D1", "D1", "D1",
"D1", "D1"), Value = c(1L, 2L, 3L, 1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 5L, 1L, 2L, 3L, 4L, 5L, 6L), Label = c("Red", "Blue", "Yellow",
"Up", "Down", "Left", "Right", "Boston", "Atlanta", "Dallas",
"New York", "Los Angeles", "John", "Jim", "Jake", "Bill", "Bob",
"Brian")), class = "data.frame", row.names = c(NA, -18L))
I'm trying to write a function to automatically label all the factor variables. The function reduces down the data to make sure that they each contain the exact same variables and then are in the exact same order. I split the table above into a list using the split function, and then each variable name above has it's own list, but I'm encountering an error when I try to subset the list in the for loop.
Below is the for loop I have written.
df = main dataset
labels_list = list with the value and text labels
for(i in 1:ncol(df)) {
for(j in labels_list) {
if(names(x[,i]) == names(ahs_split[[j]])) {
x[,i] <- factor(x[,i], levels = c(ahs_split[[j]][[2]]), labels = c(ahs_split[[j]][[3]]))
As I mentioned, my ultimate goal is to take this dataframe with the text labels and corresponding values for each variable and apply it to each one individually using the factor function. I've tried for almost a month now and am just very stuck so I could use any help. I'm not sure if anyone could possibly recommend a better approach or point me in the right direction. I would greatly appreciate any help.
If you don't mind some tidyverse verbs, you can reshape your data with tidyr::gather. Once it's in a long shape, you can join the data with the code lookup by variable name, and reshape it back into a wide format. This workflow scales for however many columns you need.
library(dplyr)
library(tidyr)
labels_list <- structure(list(Variable = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("A1",
"B1", "C1", "D1"), class = "factor"), Value = c(1L, 2L, 3L, 1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 6L), Label = structure(c(15L,
3L, 18L, 17L, 8L, 12L, 16L, 5L, 1L, 7L, 14L, 13L, 11L, 10L, 9L,
2L, 4L, 6L), .Label = c("Atlanta", "Bill", "Blue", "Bob", "Boston",
"Brian", "Dallas", "Down", "Jake", "Jim", "John", "Left", "Los_Angeles",
"New_York", "Red", "Right", "Up", "Yellow"), class = "factor")), class = "data.frame", row.names = c(NA,
-18L))
df <- tibble(A1 = rep(1:3,2),
B1 = c(1:4, 1, 2),
C1 = c(1:5, 1),
D1 = 1:6
)
A row number iterated over Variable will be necessary to spread the data, but you can drop it after it's no longer needed.
df %>%
gather(key = Variable, value = Value) %>%
left_join(labels_list, by = c("Variable", "Value")) %>%
select(-Value) %>%
group_by(Variable) %>%
mutate(row = row_number()) %>%
spread(key = Variable, value = Label)
#> Warning: Column `Variable` joining character vector and factor, coercing
#> into character vector
#> # A tibble: 6 x 5
#> row A1 B1 C1 D1
#> <int> <fct> <fct> <fct> <fct>
#> 1 1 Red Up Boston John
#> 2 2 Blue Down Atlanta Jim
#> 3 3 Yellow Left Dallas Jake
#> 4 4 Red Right New_York Bill
#> 5 5 Blue Up Los_Angeles Bob
#> 6 6 Yellow Down Boston Brian
One way is to convert your labels_list into a list of lists:
library(dplyr) # just using dplyr for the pipe %>%, otherwise everything is in base R
# Convert df to list of key:value pairs
labels_list <- labels_list %>%
split(f = labels_list$VariableName) %>%
lapply(function(x) list(key = x$Value, value = x$Label))
e.g.:
$A1
$A1$key
[1] 1 2 3
$A1$value
[1] "Red" "Blue" "Yellow"
This can be mapped onto your df col-wise with apply. This is a bit hacky as I put the column name as the first item of the vector passed to the function.
# Map labels onto sample data with factor()
apply(rbind(names(df), df),
2,
function(x) factor(x[2:length(x)],
levels = labels_list[[x[1]]]$key,
labels = labels_list[[x[1]]]$value)) %>%
as.data.frame()
A1 B1 C1 D1
1 Blue Up Dallas Jake
2 Red Down New York Jake
3 Yellow Left Boston Jim
4 Yellow Right Boston John
5 Yellow Down Los Angeles Jake
6 Red Left Atlanta Jake
7 Blue Down New York John
8 Red Down Atlanta Brian
9 Blue Up New York Jim
10 Yellow Down Atlanta Bill
Sample Data
set.seed(1724)
df <- data.frame(A1 = floor(runif(10, 1, 4)),
B1 = floor(runif(10, 1, 5)),
C1 = floor(runif(10, 1, 6)),
D1 = floor(runif(10, 1, 7)))
I am currently experiencing a problem where I have a long dataframe (i.e., multiple rows per subject) and want to remove cases that don't have any measurements (in any of the rows) on one variable. I've tried transforming the data to wide format, but this was a problem as I can't go back anymore (going from long to wide "destroys" my timeline variable). Does anyone have an idea about how to fix this problem?
Below is some code to simulate the head of my data. Specifically, I want to remove cases that don't have a measurement of extraversion on any of the measurement occasions ("time").
structure(list(id = c(1L, 1L, 2L, 3L, 3L, 3L), time = c(79L, 95L, 79L, 28L, 40L, 52L),
extraversion = c(3.2, NA, NA, 2, 2.4, NA), satisfaction = c(3L, 3L, 4L, 5L, 5L, 9L),
`self-esteem` = c(4.9, NA, NA, 6.9, 6.7, NA)), .Names = c("id", "time", "extraversion",
"satisfaction", "self-esteem"), row.names = c(NA, 6L), class = "data.frame")
Note: I realise the missing of my extraversion variable coincides with my self-esteem variable.
To drop an entire id if they don't have any measurements for extraversion you could do:
library(data.table)
setDT(df)[, drop := all(is.na(extraversion)) ,by= id][!df$drop]
# id time extraversion satisfaction self-esteem drop
#1: 1 79 3.2 3 4.9 FALSE
#2: 1 95 NA 3 NA FALSE
#3: 3 28 2.0 5 6.9 FALSE
#4: 3 40 2.4 5 6.7 FALSE
#5: 3 52 NA 9 NA FALSE
Or you could use .I which I believe should be faster:
setDT(df)[df[,.I[!all(is.na(extraversion))], by = id]$V1]
Lastly, a base R solution could use ave (thanks to #thelatemail for the suggestion to make it shorter/more expressive):
df[!ave(is.na(df$extraversion), df$id, FUN = all),]
Assuming the data frame is named mydata, use a dplyr filter:
library(dplyr)
mydata %>%
group_by(id) %>%
filter(!all(is.na(extraversion))) %>%
ungroup()
d <-
structure(
list(
id = c(1L, 1L, 2L, 3L, 3L, 3L),
time = c(79L, 95L, 79L, 28L, 40L, 52L),
extraversion = c(3.2, NA, NA, 2, 2.4, NA),
satisfaction = c(3L, 3L, 4L, 5L, 5L, 9L),
`self-esteem` = c(4.9, NA, NA, 6.9, 6.7, NA)
),
.Names = c("id", "time", "extraversion",
"satisfaction", "self-esteem"),
row.names = c(NA, 6L),
class = "data.frame"
)
d[complete.cases(d$extraversion), ]
d[is.na(d$extraversion), ]
complete.cases is great if you wanted to remove any rows with missing data: complete.cases(d)
"f","index","values","lo.80","lo.95","hi.80","hi.95"
"auto.arima",2017-07-31 16:40:00,2.81613884762163,NA,NA,NA,NA
"auto.arima",2017-07-31 16:40:10,2.83441637197378,NA,NA,NA,NA
"auto.arima",2017-07-31 20:39:10,3.18497899649267,2.73259824384436,2.49312233904087,3.63735974914098,3.87683565394447
"auto.arima",2017-07-31 20:39:20,3.16981166809297,2.69309866988864,2.44074205235297,3.64652466629731,3.89888128383297
"ets",2017-07-31 16:40:00,2.93983529828936,NA,NA,NA,NA
"ets",2017-07-31 16:40:10,3.09739640066054,NA,NA,NA,NA
"ets",2017-07-31 20:39:10,3.1951571771414,2.80966705285567,2.60560090776504,3.58064730142714,3.78471344651776
"ets",2017-07-31 20:39:20,3.33876776870274,2.93593322313957,2.72268549604222,3.7416023142659,3.95485004136325
"bats",2017-07-31 16:40:00,2.82795253090081,NA,NA,NA,NA
"bats",2017-07-31 16:40:10,2.96389759682623,NA,NA,NA,NA
"bats",2017-07-31 20:39:10,3.1383560278272,2.76890864400062,2.573335012715,3.50780341165378,3.7033770429394
"bats",2017-07-31 20:39:20,3.3561357998535,2.98646195085452,2.79076843614824,3.72580964885248,3.92150316355876
I have a dataframe like above which has column names as:"f","index","values","lo.80","lo.95","hi.80","hi.95".
What I want to do is calculate the weighted average on forecast results from different models for a particular timestamp. By this what i mean is
For every row in auto.arima there is a corresponding row in ets and bats with the same timestamp value, so weighted average should be calculated something like this:
value_arima*1/3 + values_ets*1/3 + values_bats*1/3 ; similary values for lo.80 and other columns should be calculated.
This result should be stored in a new dataframe with all the weighted average values.
New dataframe can look something like:
index(timesamp from above dataframe),avg,avg_lo_80,avg_lo_95,avg_hi_80,avg_hi_95
I think I need to use spread() and mutate () function to achieve this. Being new to R I'm unable to proceed after forming this dataframe.
Please help.
The example you provide is not a weighted average but a simple average.
What you want is a simple aggregate.
The first part is your dataset as provided by dput (better for sharing here)
d <- structure(list(f = structure(c(1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L,
2L, 2L, 2L, 2L), .Label = c("auto.arima", "bats", "ets"), class = "factor"),
index = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L,
3L, 4L), .Label = c("2017-07-31 16:40:00", "2017-07-31 16:40:10",
"2017-07-31 20:39:10", "2017-07-31 20:39:20"), class = "factor"),
values = c(2.81613884762163, 2.83441637197378, 3.18497899649267,
3.16981166809297, 2.93983529828936, 3.09739640066054, 3.1951571771414,
3.33876776870274, 2.82795253090081, 2.96389759682623, 3.1383560278272,
3.3561357998535), lo.80 = c(NA, NA, 2.73259824384436, 2.69309866988864,
NA, NA, 2.80966705285567, 2.93593322313957, NA, NA, 2.76890864400062,
2.98646195085452), lo.95 = c(NA, NA, 2.49312233904087, 2.44074205235297,
NA, NA, 2.60560090776504, 2.72268549604222, NA, NA, 2.573335012715,
2.79076843614824), hi.80 = c(NA, NA, 3.63735974914098, 3.64652466629731,
NA, NA, 3.58064730142714, 3.7416023142659, NA, NA, 3.50780341165378,
3.72580964885248), hi.95 = c(NA, NA, 3.87683565394447, 3.89888128383297,
NA, NA, 3.78471344651776, 3.95485004136325, NA, NA, 3.7033770429394,
3.92150316355876)), .Names = c("f", "index", "values", "lo.80",
"lo.95", "hi.80", "hi.95"), class = "data.frame", row.names = c(NA,
-12L))
> aggregate(d[,3:7], by = d["index"], FUN = mean)
index values lo.80 lo.95 hi.80 hi.95
1 2017-07-31 16:40:00 2.861309 NA NA NA NA
2 2017-07-31 16:40:10 2.965237 NA NA NA NA
3 2017-07-31 20:39:10 3.172831 2.770391 2.557353 3.575270 3.788309
4 2017-07-31 20:39:20 3.288238 2.871831 2.651399 3.704646 3.925078
You can save this output in an object and change the column names as you want.
If you really want a weighted average this is a way to obtain it (here bat has a weight of 0.8 and the 2 others 0.1) :
> d$weight <- (d$f)
> levels(d$weight) # check the levels
[1] "auto.arima" "bats" "ets"
> levels(d$weight) <- c(0.1, 0.8, 0.1)
> # transform the factor into numbers
> # warning as.numeric(d$weight) is not correct !!
> d$weight <- as.numeric(as.character((d$weight)))
>
> # Here the result is saved in a data.frame called "result
> result <- aggregate(d[,3:7] * d$weight, by = d["index"], FUN = sum)
> result
index values lo.80 lo.95 hi.80 hi.95
1 2017-07-31 16:40:00 2.837959 NA NA NA NA
2 2017-07-31 16:40:10 2.964299 NA NA NA NA
3 2017-07-31 20:39:10 3.148698 2.769353 2.568540 3.528043 3.728857
4 2017-07-31 20:39:20 3.335767 2.952073 2.748958 3.719460 3.922576
I'm trying to get the data from column one that matches with column 2 but only on the "B" values. Need to somehow make the true values a list.
Need this to repeat for 50,000 rows. Around 37,000 of them are true.
I'm incredibly new to this so any help would be nice.
Data <- data.frame(
X = sample(1:10),
Y = sample(c("B", "W"), 10, replace = TRUE)
)
Count <- 1
If(data[count,2] == "B") {
List <- list(data[count,1]
Count <- count + 1
#I'm not sure what to use to repeat I just put
Repeat
} else {
Count <- count + 1
Repeat
}
End result should be a list() of only column one data.
In this if rows 1-5 had "B" I want the column one numbers from that.
Not sure if I understood correctly what you're looking for, but from the comments I would assume that this might help:
setNames(data.frame(Data[1][Data[2]=="B"]), "selected")
# selected
#1 2
#2 5
#3 7
#4 6
No loop needed.
data
Data <- structure(list(X = c(10L, 4L, 9L, 8L, 3L, 2L, 5L, 1L, 7L, 6L),
Y = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L),
.Label = c("B", "W"), class = "factor")),
.Names = c("X", "Y"), row.names = c(NA, -10L),
class = "data.frame")