Why does spread() create a NA-only column? - r

I'm still an R beginner, so I hope this question is not redundant but I couldn't find a satisfying answer to my problem. Although this Question seems to be very similar, I still wonder whether my observation represents the standard case. Using the funcion tidyr::spread results in an awkward behaviour when I try to spread three unique observations in one column that contain NAs. The result is a tibble with three new columns (as expected) but also with an additional fourth column named "NA" which is completely filled with NAs.
Here is my example dataframe:
test <- data.frame("Country" = c("A", "A", "A", "A", "A", "A", "A", "A"),
"Column1" = c(1, 1, 1, 1, 1, 1, 2, 2),
"Column2" = c(3, 3, 3, 4, 4, 4, 5, 5),
"Column3" = c("B", "M", "F", "B", "M", "F", "B", NA),
"Column4" = c(50, 74, 31, 53, 79, 33, 51, NA))
test1 <- spread(test, key = "Column3", value = "Column4")
test1
Is this normal when my tibble contains missing values? And if so, why? The creation of an additional column being completely filled with missing values as a standard behaviour seems strange to me. Or am I missing something obvious (probably)?
Any help would be much appreciated!

spread is behaving as expected, though the repeated presence of NA as both a column name and as values in the data frames might make the behavior unclear. Let's change the data frame to use a dummy value of 999 in "Column4":
test <- data.frame("Country" = c("A", "A", "A", "A", "A", "A", "A", "A"), "Column1" = c(1, 1, 1, 1, 1, 1, 2, 2), "Column2" = c(3, 3, 3, 4, 4, 4, 5, 5), "Column3" = c("B", "M", "F", "B", "M", "F", "B", 'NA'), "Column4" = c(50, 74, 31, 53, 79, 33, 51, 999))
Country Column1 Column2 Column3 Column4
1 A 1 3 B 50
2 A 1 3 M 74
3 A 1 3 F 31
4 A 1 4 B 53
5 A 1 4 M 79
6 A 1 4 F 33
7 A 2 5 B 51
8 A 2 5 NA 999
And now the spread operation:
test1 <- spread(test, key = "Column3", value = "Column4")
Country Column1 Column2 B F M NA
1 A 1 3 50 31 74 NA
2 A 1 4 53 33 79 NA
3 A 2 5 51 NA NA 999
spread has correctly placed the 999 value in the new "NA" column (again, new column names taken from the old values in "Column3"), and aligned this value with matching values from the original data frame. Because 999 only appears once in the original data frame, it only has 1 matching row in the new data frame, and all other rows in the new "NA" column are therefore filled with NA (again, somewhat confusingly here).

Related

Conditional rolling sum based on another column

I would like to compute the conditional rolling sum of a column, but based on the values of another column.
I have a table like this:
data_frame <- data.frame( category1 = c("A", "A", "A", "B", "B", "B", "A", "A", "B"),
category2 = c("B", "B", "B", "A", "A", "A", "B", "B", "A"),
value = c(1, 2, 1, 2, 1, 5, 3, 4, 2),
desired_output = c(0, 0, 0, 4, 4, 4, 8, 8, 11))
data_frame2 <- data_frame %>%
group_by(category1) %>%
mutate(cumsum = cumsum(value))
category1 category2 value cumsum desired_output
A B 1 1 0
A B 2 3 0
A B 1 4 0
B A 2 2 4
B A 1 3 4
B A 5 8 4
A B 3 7 8
A B 4 11 8
B A 2 10 11
I am able to compute the rolling sum of the value based on category1 or category2 using cumsum, but I would like a column which calculates a rolling sum of the value column when category1 equals the current value of category2. For example, in the last row of the above example it sums the value of all the above rows when category1 == A, as the current value of category2 is A.
I have tried various hacky ifelse/lag/fill solutions but nothing gets close to what I need. I have also tried adding a conditional into the ave function, as below, but not sure what the syntax should be...
data_frame2$desired_output <- ave(data_frame2$value, data_frame2$category1 = data_frame2$category2, FUN=cumsum)
Thanks in advance - first question so apologies about anything I missed/got wrong!

Look up/match values within the same dataframe column in R

Given data.frame(code=c(10, 20, 21, 22, 23, 31, 32, 40, 50), label=c("a", "b", "c", "d", "e", "f", "g", "h", "i")), I'd like c("", "", "b", "b", "b", "", "", "", "").
If the value is not a multiple of 10, assign the label of the immediately previous multiple of 10 if it is listed. If the immediately previous multiple of 10 is not listed, assign blank. If the value is a multiple of 10, assign blank. (Unlike this dummy example, multiple sequences of non-multiples of 10 may occur in the data and the values may not be ordered.)
Ideally, I'd like to do this as a vector operation in base R, for speed and parsimony.
EDIT: I was trying to simplify my question as much as possible but maybe it was misleading so here is the final output I'm aiming for: data.frame(code=c(10, 20, 21, 22, 23, 31, 32, 40, 50), label=c("a", "b", "b c", "b d", "b e", "f", "g", "h", "i")). That is: prepend the intermediate output to the label column.
This looks like an overkill but seems to work :
library(dplyr)
library(tidyr)
df %>%
#arrange the data based on value
arrange(code) %>%
#Get closest multiple of 10
mutate(multiple10 = floor(code/10) * 10,
#If completely divisible by 10 assign label else NA
result = ifelse(code %% 10 == 0, label, NA)) %>%
#For each multiple of 10
group_by(multiple10) %>%
#fill NA by most recent non-NA in the group
fill(result) %>%
ungroup %>%
#Turn NA to blank along with numbers which are completely divisible by 10
mutate(result = replace(result, code == multiple10 | is.na(result), ''))
# code label multiple10 result
# <dbl> <chr> <dbl> <chr>
#1 10 a 10 ""
#2 20 b 20 ""
#3 21 c 20 "b"
#4 22 d 20 "b"
#5 23 e 20 "b"
#6 31 f 30 ""
#7 32 g 30 ""
#8 40 h 40 ""
#9 50 i 50 ""

Pivot_wider introduces NA's

I am doing datamanagement for a project and I am running into difficulties with what I thought would be a basic reshape from Long format to Wide.
The Data looks something like this:
df <- structure(list(ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),
Time = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 1, 1, 1, 1, 2, 2),
Type = c("A", "B", "C", "D", "A", "B","C", "D", "A", "A", "B", "C", "D", "A", "B"),
Value = c(100, NA, 40, 123, 95, NA, 45, 1234, 100, 70, NA, 50, 12345, 75, NA)),
row.names = c(NA, 15L), class = "data.frame")
Based on previous Stackoverflow Answers I am trying to use pivot-wider like this:
df.wide <- df %>%
group_by(ID, Type) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = Type, values_from = Value)
However this returns a dataframe with NA values at max(Time) for each ID that looks like this:
# A tibble: 5 x 7
ID Time row A B C D
<dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 100 NA 40 123
2 1 2 2 95 NA 45 1234
3 1 3 3 100 NA NA NA
4 2 1 1 70 NA 50 12345
5 2 2 2 75 NA NA NA
What am I doing wrong? My google and Stackoverflow-fu has not been able to help me.

How can I fill columns based on values in another column? [duplicate]

This question already has answers here:
Transpose / reshape dataframe without "timevar" from long to wide format
(9 answers)
Closed 3 years ago.
I have a large dataframe containing a cross table of keys from other tables. Instead of having multiple instances of key1 coupled with different values for key2 I would like there to be one row for each key1 with several columns instead.
I tried doing this with a for loop but it couldn't get it to work.
Here's an example. I have a data frame with the structure df1 and I would like it to have the structure of df2.
df1 <- data.frame(c("a", "a", "a", "b", "b", "c", "c", "c", "c", "c", "d"),c(1, 2, 3, 2, 3, 1, 2, 3, 4, 5, 9))
names(df1) <- c("key1", "key2")
df2 <- data.frame(c("a", "b", "c", "d"), c(1, 2, 1, 9), c(2, 3, 2, NA), c(3, NA, 3, NA), c(NA, NA, 4, NA), c(NA, NA, 5, NA))
names(df2) <- c("key1", "key2_1", "key2_2", "key2_3", "key2_4", "key2_5")
I suspect this is possible using an approach utilizing apply but I haven't found a way yet. Any help is appreciated!
library(dplyr)
library(tidyr)
df1 %>%
group_by(key1) %>%
mutate(var = paste0("key2_", seq(n()))) %>%
spread(var, key2)
# # A tibble: 4 x 6
# # Groups: key1 [4]
# key1 key2_1 key2_2 key2_3 key2_4 key2_5
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 2 3 NA NA
# 2 b 2 3 NA NA NA
# 3 c 1 2 3 4 5
# 4 d 9 NA NA NA NA

Dynamic Programming Knapsack select only Unique Items in R

I have this code that selects multiple items to put in a knapsack from a dataframe. I wanted that it only selects an item from the dataframe only once:-
knapsack_volume<-function(Data, W, Volume, full_K){
Data = Data
# Data must have the colums with names: item, value, weight and volume.
K<-list() # hightest values
K_item<-list() # itens that reach the hightest value
K<-rep(0,W+1) # The position '0'
K_item<-rep('',W+1) # The position '0'
# while(length(Data$item) != 1){
for(w in 1:W){
temp_w<-0
temp_item<-''
temp_value<-0
for(i in 1:dim(Data)[1]){ # each row
wi<-Data$weight[i] # item i
vi<- Data$value[i]
item<-Data$item[i]
volume_i<-Data$volume[i]
if(wi<=w & volume_i <= Volume){
back<- full_K[[Volume-volume_i+1]][w-wi+1]
temp_wi<-vi + back
if(temp_w < temp_wi){
temp_value<-temp_wi
temp_w<-temp_wi
temp_item <- item
}
}
# Data = Data[-i, ]
}
K[[w+1]]<-temp_value
K_item[[w+1]]<-temp_item
}
return(list(K=K,Item=K_item))
}
The DataFrame looks like:-
item value weight volume
A 40 4 8
B 80 8 12
C 20 4 6
D 100 10 14
E 65 8 8
F 60 10 5
G 70 5 12
H 45 5 7
I 60 6 6
J 60 4 8
You may reproduce the dataframe with:-
Data = data.frame(item = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J"),
value = c(40, 80, 20, 100, 65, 60, 70, 45, 60, 60), weight = c(4, 8, 4, 10, 8,
10, 5, 5, 6, 4), volume = c(8, 12, 6, 14, 8, 5, 12, 7, 6, 8))
Thanks
How about deleting the item from dataframe once you put it in your knapsack? However you need a guarantee that your knapsack can be completely filled with unique items.

Resources