Separating multi-valued attributes into individual attributes R

Separating multi-valued attributes into individual attributes R - r

I'm working with the stackoverflow developer survey data-set and attempting to predict compensation based on technologies worked with and collaborative tools worked with. These two attributes are multi-valued with semicolons separating the individual values.
For instance, under the CollabToolsWorkedWith attribute in one row, there is Confluence;Jira;Github;Slack;Microsoft;Teams;Google Suite. I want to give each of these values their own column with a value of either 0 or 1 if the row had that value.
The end result would have each row contain a column for every single value under CollabToolsWorkedWith and each column would contain 0's and 1's based on whether or not the row contained that value.

You may get a quicker answer next time if you provide some sample data that everyone can quickly access. I found the 2020 data online. Here is my answer:
# read the data frame
rm(list = ls())
df <- read.csv("survey_results_public.csv")
# figure out which column you are talking about
data.frame(colnames(df))
table(df$NEWCollabToolsWorkedWith)
# convert to lower case and character
df$NEWCollabToolsWorkedWith <- as.character(df$NEWCollabToolsWorkedWith)
df$NEWCollabToolsWorkedWith <- tolower(df$NEWCollabToolsWorkedWith)
# keep only the useful variables and separate based on ;
library(tidyverse)
library(splitstackshape)
namesdf <- df %>% select(NEWCollabToolsWorkedWith)
namesdf <- cSplit(namesdf,"NEWCollabToolsWorkedWith", sep = ";", direction = "wide", drop=TRUE,
type.convert = TRUE)
# stack stuff on top of each other to find unique list of tools/platforms
long_data_frame <-
namesdf %>%
pivot_longer(cols = starts_with("NEWCollabToolsWorkedWith"), # use columns starting with "year"
names_to ="unique", # name of new column
names_prefix = "_",
values_drop_na = TRUE) %>%
distinct(value)
# clean the variable names
library(janitor)
long_data_frame$value = as.character(long_data_frame$value)
long_data_frame$value = janitor::make_clean_names(long_data_frame$value)
# get final unique list
table(long_data_frame$value)
> table(long_data_frame$value)
confluence facebook_workplace github gitlab
1 1 1 1
google_suite_docs_meet_etc jira microsoft_azure microsoft_teams
1 1 1 1
slack stack_overflow_for_teams trello
1 1 1
# create new variables
df$confluence <- NA
df$jira <- NA
df$slack = NA
df$microsoft_azure =NA
df$trello = NA
df$github = NA
df$gitlab = NA
df$google_suite_docs_meet_etc = NA
df$microsoft_teams = NA
df$stack_overflow_for_teams = NA
df$facebook_workplace =NA
# make a dummy variable based on string match
df$confluence <- as.integer(grepl(pattern = "confluence", x = df$NEWCollabToolsWorkedWith))
df$jira <- as.integer(grepl(pattern = "jira", x = df$NEWCollabToolsWorkedWith))
df$slack <- as.integer(grepl(pattern = "slack", x = df$NEWCollabToolsWorkedWith))
df$microsoft_azure <- as.integer(grepl(pattern = "microsoft azure", x = df$NEWCollabToolsWorkedWith))
df$trello <- as.integer(grepl(pattern = "trello", x = df$NEWCollabToolsWorkedWith))
df$github <- as.integer(grepl(pattern = "github", x = df$NEWCollabToolsWorkedWith))
df$gitlab <- as.integer(grepl(pattern = "gitlab", x = df$NEWCollabToolsWorkedWith))
df$google_suite_docs_meet_etc <- as.integer(grepl(pattern = "google", x = df$NEWCollabToolsWorkedWith))
df$microsoft_teams <- as.integer(grepl(pattern = "microsoft teams", x = df$NEWCollabToolsWorkedWith))
df$stack_overflow_for_teams <- as.integer(grepl(pattern = "overflow", x = df$NEWCollabToolsWorkedWith))
df$facebook_workplace <- as.integer(grepl(pattern = "facebook", x = df$NEWCollabToolsWorkedWith))
# proof that it went through
table(df$facebook_workplace)
> table(df$facebook_workplace)
0 1
62881 1580

Related

Read multiple files into DuckDB in R from CSV, with new variable indicating year from filename

I have several (8) large files (1M rows each) with the same variables/format saved individually by year.
I would like to save to a single table using the duckdb database format in R.
The duck_read_csv() command does this nicely.
The problem: there is no variable indicating "year" using this method, so the trend for repeated measurements is lost. My DBI and SQL knowledge is limited, so there's probably an easy way to do this, but I'm stuck here (simple example demonstrating issue below):
library(duckdb)
con <- dbConnect(duckdb())
# example year 1 data
data <- data.frame(id = 1:3, b = letters[1:3]) # year = 1
path <- tempfile(fileext = ".csv")
write.csv(data, path, row.names = FALSE)
# example year 2 data
data2 <- data.frame(id = 1:3, b = letters[4:6]) # year = 2
path2 <- tempfile(fileext = ".csv")
write.csv(data2, path2, row.names = FALSE)
duckdb_read_csv(con, "data", files = c(path, path2)) # data is appended
which yields the following- data is appended by row but i need a variable to indicate "year":
dbReadTable(con, "data")
id b
1 1 a
2 2 b
3 3 c
4 1 d
5 2 e
6 3 f
Is there a way to create a new variable during this process, or is it better to create a table for

I modified answer by Jon to solve my issue (thanks Jon).
Files were too large to read into memory all at once (~20M rows each), so I iterated over them one at a time to transfer into the database (~198M rows total).
I also extracted the "year" from the filename using gsub into a separate variable yrs.
library(dplyr)
library(duckdb)
con <- dbConnect(duckdb())
# example year 1 data
data <- data.frame(id = 1:3, b = letters[1:3]) # year = 1
path1 <- tempfile(fileext = ".csv")
write.csv(data, path1, row.names = FALSE)
# example year 2 data
data2 <- data.frame(id = 1:3, b = letters[4:6]) # year = 2
path2 <- tempfile(fileext = ".csv")
write.csv(data2, path2, row.names = FALSE)
paths <- c(path1, path2)
yrs <- c("year1", "year2")
purrr::walk2(paths, yrs,
~{
readr::read_csv(.x, n_max = Inf) %>% # read 1 file into memory at a time
mutate(year = .y) %>%
rename_all(. %>% tolower()) %>% # i really prefer lower case variable names
duckdb::dbWriteTable(conn = con, name = "data", value = ., append =T)
})

One approach could be to use purrr::map_dfr + readr::read_csv for the reading, which allows you to assign an "id" column based on names assigned to the file paths, and then register that as a duckdb table:
library(dplyr)
purrr::map_dfr(c(year01 = path,
year02 = path2),
~readr::read_csv(.x),
.id = "year") %>%
duckdb_register(con, "data", .)
Result
tbl(con, "data") %>%
collect()
# A tibble: 6 × 3
year id b
<chr> <dbl> <chr>
1 year01 1 a
2 year01 2 b
3 year01 3 c
4 year02 1 d
5 year02 2 e
6 year02 3 f

Network graph R - joining

Looking for some help on joining in order to make a forceNetwork() graph using networkd3. I just can't figure out what's wrong with the code below as I'm getting the following error/warning message.
I used this code before and I got it to work back then - just not sure what's different this time as I feel the input file is the same.
Warning messages:
1: Column `src`/`name` joining factors with different levels, coercing to character vector
2: Column `target`/`name` joining factors with different levels, coercing to character vector
# Load package
library(networkD3)
library(dplyr)
# Create data
src <- c(all_artists$from)
target <- c(all_artists$to)
networkData <- data.frame(src, target, stringsAsFactors = TRUE)
networkData
nodes <- data.frame(name = unique(c(src, target)), size = all_artists$related_artist_followers, stringsAsFactors = TRUE)
nodes$id <- 0:(nrow(nodes) - 1)
nodes
width <- c(all_artists$related_artist_followers)
width
# create a data frame of the edges that uses id 0:9 instead of their names
edges <- networkData %>%
left_join(nodes, by = c("src" = "name")) %>%
select(-src) %>%
rename(source = id) %>%
left_join(nodes, by = c("target" = "name")) %>%
select(-target) %>%
rename(target = id)
The dataset shows the artists that are related to each other - from is the nodes and to is the edges.
from to artist_popularity
Jay-Z Kanye West 80
Jay-Z P. Diddy 60
Kanye West Kid Cudi 40

The line where you build the nodes data frame seems unlikely to work as expected because there's no connection between the length of unique(c(src, target)) and all_artists$related_artist_followers. You could count the number of times a node/name appears in the networkData$src or all_artists$from column with...
nodes$size <- sapply(nodes$name, function(name) sum(networkData$src %in% name))
Once you have the nodes data frame created, it's easy to convert the names in the networkData data frame to zero-indexed indices with...
networkData$src <- match(networkData$src, nodes$name) - 1
networkData$target <- match(networkData$target, nodes$name) - 1
Note that it is also mandatory to provide a Value parameter for the Links data frame and a Group parameter for the Nodes data frame (any parameter that does not have a default value in the help file is mandatory, otherwise you might get an error or unexpected behavior... that goes for all R functions, not just networkd3). You can create columns in your data frames for them like this...
networkData$value <- 1
nodes$group <- 1
So all together in a reproducible example, you might have...
from <- c("Jay-Z", "Jay-Z", "Kanye West")
to <- c("Kanye West", "P. Diddy", "Kid Cudi")
artist_popularity <- c(80, 60, 40)
all_artists <- data.frame(from, to, artist_popularity, stringsAsFactors = FALSE)
networkData <- data.frame(src = all_artists$from, target = all_artists$to,
stringsAsFactors = FALSE)
nodes <- data.frame(name = unique(c(networkData$src, networkData$target)),
stringsAsFactors = FALSE)
nodes$size <- sapply(nodes$name, function(name) sum(networkData$src %in% name))
networkData$src <- match(networkData$src, nodes$name) - 1
networkData$target <- match(networkData$target, nodes$name) - 1
networkData$value <- 1
nodes$group <- 1
library(networkD3)
forceNetwork(Links = networkData, Nodes = nodes, Source = "src",
Target = "target", Value = "value", NodeID = "name",
Nodesize = "size", Group = "group", opacityNoHover = 1)

If else ladder not working in R

I have this in my dataframe after reading and rearranging multiple csv files. Basically I want an if else ladder to refer to the ID column and if it matches a number from the list of concatenates then place a word in a new "group" column
# of int. int. not.int. ID
1 50 218.41 372.16 1
3 33 134.94 158.17 3
I then made these concatenates to refer to.
veh = as.character(c('1', '5'))
thc1 = as.character(c('2', '6'))
thc2 = as.character(c('3', '7'))
thc3 = as.character(c('4', '8'))
Then I made an if else ladder to list the corresponding values.
social.dat$group = if (social.dat$ID == veh) {
social.dat$group == "veh"
} else if (social.dat$group == thc1) {
social.dat$group == "thc1"
} else if (social.dat$group == thc2) {
social.dat$group == "thc2"
} else {
social.dat$group == "thc3"
}
However, I then get this warning message.
Warning message:
In if (social.dat$ID == veh) { :
the condition has length > 1 and only the first element will be used
I have looked up this warning message in multiple different variations and have not found anything that really helped. Any help for this would be much appreciated or and alternate options would be good as well. I apologize in advance if there was a solution on stack already if I missed it.
EDIT:
I tried using
social.dat$group = ifelse(social.dat$ID == veh, "veh", "thc")
social.dat$group = ifelse(social.dat$ID == thc, "thc", "veh")
but it changed the output of the dataframe after each line.
Here is the full code i am using to rearrange the csv files and get the dataframe that I first mentioned above.
#calls packages
library(tidyr)
library( plyr )
library(reshape2)
#make sure to change your working directory to where the files are kept
setwd("C:/Users/callej03/Desktop/test")
wd = "C:/Users/callej03/Desktop/test"
files = list.files(path=wd, pattern="*.csv", full.names=TRUE,
recursive=FALSE)
################################################################
#this function creates a list of the number of interactions for each file in
the folder
lap.list = lapply(files, function(x) {
dat = read.csv(x, header= TRUE)
dat = dat[-c(1),]
dat = as.data.frame(dat)
dat = separate(data = dat, col = dat, into = c("lap", "duration"), sep = "\\
")
dat$count = 1:nrow(dat)
y = dat$count
i= y%%2==0
dat$interacting = i
int = dat[which(dat$interacting == TRUE),]
interactions = sum(int$interacting)
})
#########################################################################
#this changes the row name to the name of the file - i.e. the rat ID
lap.list = as.data.frame(lap.list)
lap.list = t(lap.list)
colnames(lap.list) = c("# of int.")
row.names(lap.list) = sub(wd, "", files)
row.names(lap.list) = gsub("([0-9]+).*$", "\\1", rownames(lap.list))
row.names(lap.list) = gsub('/', "", row.names(lap.list), fixed = TRUE)
###########################################################################
#this applies almost the same function as the one listed above except I call
it a different vector name so it can be manipulated
int.duration = lapply(files, function(x) {
dat2 = read.csv(x, header= TRUE)
dat2 = dat2[-c(1),]
dat2 = as.data.frame(dat2)
dat2 = separate(data = dat2, col = dat2, into = c("lap", "duration"), sep =
"\\ ")
dat2$count = 1:nrow(dat2)
y = dat2$count
i= y%%2==0
dat2$interacting = i
int = dat2[which(dat2$interacting == TRUE),]
})
noint.duration = lapply(files, function(x) {
dat2 = read.csv(x, header= TRUE)
dat2 = dat2[-c(1),]
dat2 = as.data.frame(dat2)
dat2 = separate(data = dat2, col = dat2, into = c("lap", "duration"), sep =
"\\ ")
dat2$count = 1:nrow(dat2)
y = dat2$count
i= y%%2==0
dat2$interacting = i
noint = dat2[which(dat2$interacting == FALSE),]
})
###################################################################
#this splits the output time of minutes, seconds, and milliseconds.
#then it combines them into a total seconds.milliseconds readout.
#after that, it takes the sum of the times for each file and combines them
with the total interactions dataframe.
int.duration = melt(int.duration)
int.duration = as.data.frame(int.duration)
int.left = as.data.frame(substr(int.duration$duration, 1, 2))
colnames(int.left) = "min"
int.mid = as.data.frame(substr(int.duration$duration, 4, 4 + 2 - 1))
colnames(int.mid) = "sec"
int.right = as.data.frame(substr(int.duration$duration,
nchar(int.duration$duration) - (2-1), nchar(int.duration$duration)))
colnames(int.right) = "ms"
int.time = cbind(int.left, int.mid, int.right)
int.time$min = as.numeric(as.character(int.time$min))
int.time$sec = as.numeric(as.character(int.time$sec))
int.time$ms = as.numeric(as.character(int.time$ms))
int.time$ms = int.time$ms/100
int.time$min = ifelse(int.time$min > 0, int.time$min*60,0)
int.time$sum = rowSums(int.time)
int.file = as.data.frame(int.duration$L1)
colnames(int.file) = "file"
int.time = cbind(int.time, int.file)
int.tot = as.data.frame(tapply(int.time$sum, int.time$file, sum))
colnames(int.tot) = "int."
social.dat = cbind(lap.list, int.tot)
noint.duration = melt(noint.duration)
noint.duration = as.data.frame(noint.duration)
noint.left = as.data.frame(substr(noint.duration$duration, 1, 2))
colnames(noint.left) = "min"
noint.mid = as.data.frame(substr(noint.duration$duration, 4, 4 + 2 - 1))
colnames(noint.mid) = "sec"
noint.right = as.data.frame(substr(noint.duration$duration,
nchar(noint.duration$duration) - (2-1), nchar(noint.duration$duration)))
colnames(noint.right) = "ms"
noint.time = cbind(noint.left, noint.mid, noint.right)
noint.time$min = as.numeric(as.character(noint.time$min))
noint.time$sec = as.numeric(as.character(noint.time$sec))
noint.time$ms = as.numeric(as.character(noint.time$ms))
noint.time$ms = noint.time$ms/100
noint.time$min = ifelse(noint.time$min > 0, noint.time$min*60,0)
noint.time$sum = rowSums(noint.time)
noint.file = as.data.frame(noint.duration$L1)
colnames(noint.file) = "file"
noint.time = cbind(noint.time, noint.file)
noint.tot = as.data.frame(tapply(noint.time$sum, noint.time$file, sum))
colnames(noint.tot) = "not.int."
social.dat = cbind(social.dat, noint.tot)
social.dat$ID = rownames(social.dat)
Here is and axample of a csv file I am working with. The words are all in the same column and separated by spaces.
Total time 10:00.61
Lap times
01 00:07.46
02 00:05.64
03 00:01.07
04 00:01.04
05 00:04.71
06 00:06.43
07 00:12.52
08 00:07.34
09 00:05.46
10 00:05.81
11 00:05.52
12 00:06.51
13 00:10.75
14 00:00.83
15 00:03.64
16 00:02.75
17 00:01.20
18 00:06.17
19 00:04.40
20 00:00.75
21 00:00.84
22 00:01.29
23 00:02.31
24 00:03.04
25 00:02.85
26 00:05.86
27 00:05.76
28 00:05.06
29 00:00.96
30 00:06.91

#akrun suggested ifelse, which works great for one or two nestings. Much past that, and my personal preference is to use dplyr::case_when or a separate data.frame in a merge/join of sorts.
If you are using the "simple case" of assigning consistently by the same fields (id in this case), then the merge/join is my preferred method: it makes maintenance much simpler (IMO). (When I say "consistently by the same fields", I mean that you could have a id1 and id2 fields by which you define the individual records and their applicable groups. Likely too much for your example, so I'll keep this answer at one key merging.)
Three methods (data far below):
Base R
dat2a <- merge(dat, groups, by="id", all.x=TRUE)
dat2a
# id int group
# 1 1 22 veh
# 2 2 33 thc1
# 3 3 44 <NA>
Note that any id not included in the definition of groups will have NA group. You can assign a default group with this:
dat2a$group[is.na(dat2a$group)] <- "somedefaultgroup"
dat2a
# id int group
# 1 1 22 veh
# 2 2 33 thc1
# 3 3 44 somedefaultgroup
dplyr, merge/join
Similar concept, but using dplyr-esque verbs.
library(dplyr)
dat2c <- left_join(dat, groups, by="id") %>%
mutate(group = if_else(is.na(group), "somedefaultgroup", group))
dplyr::case_when
(This does not use groups as I defined for the merge/join cases.)
In case you really want to do some ladder/nesting of if/else-like statements, case_when is easier to read (and debug) and might be faster, depending on your use-case.
Most direct:
library(dplyr)
dat2b <- dat
dat2b$group <- case_when(
dat2b$id %in% c("1","5") ~ "veh",
dat2b$id %in% c("2","6") ~ "thc1",
TRUE ~ "somedefaultgroup"
)
A little easier to read than the previous by using with(...), but functionally identical. (If your "ladder" is much longer, then code-golf (number of characters in the code) can be significantly reduced.)
dat2b <- dat
dat2b$group <- with(dat2b, case_when(
id %in% c("1","5") ~ "veh",
id %in% c("2","6") ~ "thc1",
TRUE ~ "somedefaultgroup"
))
If you want to use some dplyr verbs, then:
dat2b <- dat
dat2b <- dat2b %>%
mutate(
group = case_when(
id %in% c("1","5") ~ "veh",
id %in% c("2","6") ~ "thc1",
TRUE ~ "somedefaultgroup"
)
)
Data
When doing merge/join actions, it's important to use stringsAsFactors=FALSE so that the absence of factor levels (of the newly-assigned groups) is not a problem. (This can be worked around, but ...)
dat <- data.frame(id=c("1","2","3"), int=c(22L,33L,44L),
stringsAsFactors=FALSE)
Optional use for the merge examples above:
groups <- data.frame(id=c("1","5","2","6"), group=c("veh","veh","thc1","thc1"),
stringsAsFactors=FALSE)
groups
# id group
# 1 1 veh
# 2 5 veh
# 3 2 thc1
# 4 6 thc1
The premise is that you define one row for each unique id.

Thanks to #r2evans the following code worked exactly as I wanted it to (using dplyr::case_when)
social.dat$group = case_when(
social.dat$ID %in% c("1","5") ~ "veh",
social.dat$ID %in% c("2","6") ~ "thc1",
social.dat$ID %in% c("3","7") ~ "thc2",
social.dat$ID %in% c("4","8") ~ "thc3"
)
This was the final output of the dataframe
# of int. int. not.int. ID group
1 50 218.41 372.16 1 veh
3 33 134.94 158.17 3 thc2

How to RBind First 4 Column one above Other with Tag

Below i have to tried to reproduce in representable Form
`v<- data.frame(C1TEMP = c(3,6,1,8,9,2,2,9,1,23),
C1VIB = c(5,6,1,8,9,2,2,9,1,23),
C1DE = c(9,6,1,8,9,2,2,9,1,23),
C1NDE = c(8,6,1,8,9,2,2,9,1,23),
C2TEMP = c(5,6,1,8,9,2,2,9,1,23),
C2VIB = c(378,6,1,8,9,2,2,9,1,23),
C2DE = c(3,78,1,8,9,2,2,9,1,23),
C2NDE = c(3,6,1,8,9,2,2,9,1,23),
C3TEMP= c(3,6,89,8,9,2,2,9,1,23),
C3VIB = c(3,6,1,98,9,2,2,9,1,23),
C3DE = c(33,56,91,82,99,12,22,19,81,23),
C3NDE = c(13,76,91,88,59,42,22,39,21,23))`
Here i want to rbind Every 4 column one above each Other with the tag No Along. And No of Columns will always be divisible of 4. I here with also Attaching an image for a clear picture what result should be expected.
EXPECTED OUTPUT:

I agree with YCR's comment. Still, this is a way to tackle your problem. Use the following code:
# data frames need column headers, so convert to matrix
v01 <- as.matrix(v[, 1:4])
v02 <- as.matrix(v[, 5:8])
v03 <- as.matrix(v[, 9:12])
# remove columnnames
colnames(v01) <- NULL
colnames(v02) <- NULL
colnames(v03) <- NULL
# now you can use rbind and give the columnnames back
v2 <- rbind( v01, v02, v03)
colnames(v2) <- c("C1TEMP", "C1VIB", "C1DE", "C1NDE")
v2

try this
It is a bit more convoluted than previous answers but it should be more adaptable to other data frames
# how many blocks have you got?
howMany <-table(gsub(names(v),pattern = "[0-9]",replacement = ""))[1]
# make a common name string
NAMES <- unique(gsub(names(v),pattern = "[0-9]",replacement = ""))
# create a list
list() -> V
for(i in 1:howMany){
# get the column with matching index number
v[,grep(names(v),pattern = i)] -> vi
names(vi) <- NAMES# change name
data.frame(Tag=i,vi) -> V[[i]]# put it in the list
}
# combine tables in the list into one list
do.call(rbind,V)
Nils

The melt and reshape way:
It implies to get an identifier per row:
v<- data.frame(C1TEMP = c(3,6,1,8,9,2,2,9,1,23),
C1VIB = c(5,6,1,8,9,2,2,9,1,23),
C1DE = c(9,6,1,8,9,2,2,9,1,23),
C1NDE = c(8,6,1,8,9,2,2,9,1,23),
C2TEMP = c(5,6,1,8,9,2,2,9,1,23),
C2VIB = c(378,6,1,8,9,2,2,9,1,23),
C2DE = c(3,78,1,8,9,2,2,9,1,23),
C2NDE = c(3,6,1,8,9,2,2,9,1,23),
C3TEMP= c(3,6,89,8,9,2,2,9,1,23),
C3VIB = c(3,6,1,98,9,2,2,9,1,23),
C3DE = c(33,56,91,82,99,12,22,19,81,23),
C3NDE = c(13,76,91,88,59,42,22,39,21,23),
id = 1:10
, stringsAsFactors = F)
library(tidyverse)
# melt the dataframe(reshape from wide to long format):
v_melt <- reshape2::melt(v, id.vars = "id")
# modify the aggregation variables
v_melt <- v_melt %>%
mutate(var = substr(as.character(variable), 3, 8),
group_id = paste0(substr(as.character(variable), 1, 2), "_", id))
# reshape the data frame in a wide format:
v_cast <- reshape2::dcast(v_melt, group_id ~ var, value.var = "value")

Merging Long-Form Data that has NAs with Wide-Form Complete Data To Override NAs

So I have three data sets that I need to merge. These contain school data and read/math scores for grades 4 and 5. One of them is a long form data set that has a lot of missingness in some variables (yes, I do need the data in long form) and the other two have the full missing data in wide form. All of these data frames contain a column that has an unique ID number for each individual in the database.
Here is a full reproducible example that generates a small example of the types of data.frames I am working with... The three data frames that I need to use are the following: school_lf, school4 and school5. school_lf has the long form data with NAs and school4 and school5 are the dfs I need to use to populate the NA's in this long form data (by id and grade)
set.seed(890)
school <- NULL
school$id <-sample(102938:999999, 100)
school$selected <-sample(0:1, 100, replace = T)
school$math4 <- sample(400:500, 100)
school$math5 <- sample(400:500, 100)
school$read4 <- sample(400:500, 100)
school$read5 <- sample(400:500, 100)
school <- as.data.frame(school)
# Delete observations at random from the school df
indm4 <- which(school$math4 %in% sample(school$math4, 25))
school$math4[indm4] <- NA
indm5 <- which(school$math5 %in% sample(school$math5, 50))
school$math5[indm5] <- NA
indr4 <- which(school$read4 %in% sample(school$read4, 70))
school$read4[indr4] <- NA
indr5 <- which(school$read5 %in% sample(school$read5, 81))
school$read5[indr5] <- NA
# Separate Read and Math
read <- as.data.frame(subset(school, select = -c(math4, math5)))
math <- as.data.frame(subset(school, select = -c(read4, read5)))
# Now turn this into long form data...
clr <- melt(read, id.vars = c("id", "selected"), variable.name = "variable", value.name = "readscore")
clm <- melt(math, id.vars = c("id", "selected"), value.name = "mathscore")
# Clean up the grades for each of these...
clr$grade <- ifelse(clr$variable == "read4", 4,
ifelse(clr$variable == "read5", 5, NA))
clm$grade <- ifelse(clm$variable == "math4", 4,
ifelse(clm$variable == "math5", 5, NA))
# Put all these in one df
school_lf <-cbind(clm, clr$readscore)
school_lf$readscore <- school_lf$`clr$readscore` # renames
school_lf$`clr$readscore` <- NULL # deletes
school_lf$variable <- NULL # deletes
###############
# Generate the 2 data frames with IDs that have the full data
set.seed(890)
school4 <- NULL
school4$id <-sample(102938:999999, 100)
school4$selected <-sample(0:1, 100, replace = T)
school4$math4 <- sample(400:500, 100)
school4$read4 <- sample(400:500, 100)
school4$grade <- 4
school4 <- as.data.frame(school4)
set.seed(890)
school5 <- NULL
school5$id <-sample(102938:999999, 100)
school5$selected <-sample(0:1, 100, replace = T)
school5$math5 <- sample(400:500, 100)
school5$read5 <- sample(400:500, 100)
school5$grade <- 5
school5 <- as.data.frame(school5)
I need to merge the wide-form data into the long-form data to replace the NAs with the actual values. I have tried the code below, but it introduces several columns instead of merging the read scores and the math scores where there's NA's. I simply need one column with the read scores and one with the math scores, instead of six separate columns (read.x, read.y, math.x, math.y, mathscore and readscore).
sch <- merge(school_lf, school4, by = c("id", "grade", "selected"), all = T)
sch <- merge(sch, school5, by = c("id", "grade", "selected"), all = T)
Any help is highly appreciated! I've been trying to solve this for hours now and haven't made any progress (so figured I'd ask here)

You can use the coalesce function from dplyr. If a value in the first vector is NA, it will see if the value at the same position in the second vector is not NA and select it. If again NA, it goes to the third.
library(dplyr)
sch %>% mutate(mathscore = coalesce(mathscore, math4, math5)) %>%
mutate(readscore = coalesce(readscore, read4, read5)) %>%
select(id:readscore)

EDIT: I just tried to do this approach on my actual data and it does not work because the replacement data also has some NAs and, as a result, the dfs I try to do coalesce with have differing number of rows... Back to square one.
I was able to figure this out with the following code (albeit it's not the most elegant or straight-forward ,and #Edwin's response helped point me in the right direction. Any suggestions on how to make this code more elegant and efficient are more than welcome!
# Idea: put both in long form and stack on top of one another... then merge like that!
sch4r <- as.data.frame(subset(school4, select = -c(mathscore)))
sch4m <- as.data.frame(subset(school4, select = -c(readscore)))
sch5r <- as.data.frame(subset(school5, select = -c(mathscore)))
sch5m <- as.data.frame(subset(school5, select = -c(readscore)))
# Put these in LF
sch4r_lf <- melt(sch4r, id.vars = c("id", "selected", "grade"), value.name = "readscore")
sch4m_lf <- melt(sch4m, id.vars = c("id", "selected", "grade"), value.name = "mathscore")
sch5r_lf <- melt(sch5r, id.vars = c("id", "selected", "grade"), value.name = "readscore")
sch5m_lf <- melt(sch5m, id.vars = c("id", "selected", "grade"), value.name = "mathscore")
# Combine in one DF
sch_full_4 <-cbind(sch4r_lf, sch4m_lf$mathscore)
sch_full_4$mathscore <- sch_full_4$`sch4m_lf$mathscore`
sch_full_4$`sch4m_lf$mathscore` <- NULL # deletes
sch_full_4$variable <- NULL
sch_full_5 <- cbind(sch5r_lf, sch5m$mathscore)
sch_full_5$mathscore <- sch_full_5$`sch5m$mathscore`
sch_full_5$`sch5m$mathscore` <- NULL
sch_full_5$variable <- NULL
# Stack together
sch_full <- rbind(sch_full_4,sch_full_5)
sch_full$selected <- NULL # delete this column...
# MERGE together
final_school_math <- mutate(school_lf, mathscore = coalesce(school_lf$mathscore, sch_full$mathscore))
final_school_read <- mutate(school_lf, readscore = coalesce(school_lf$readscore, sch_full$readscore))
final_df <- cbind(final_school_math, final_school_read$readscore)
final_df$readscore <- final_df$`final_school_read$readscore`
final_df$`final_school_read$readscore` <- NULL

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Separating multi-valued attributes into individual attributes R - r

Related

Read multiple files into DuckDB in R from CSV, with new variable indicating year from filename

Network graph R - joining

If else ladder not working in R

How to RBind First 4 Column one above Other with Tag

Merging Long-Form Data that has NAs with Wide-Form Complete Data To Override NAs

Categories

Resources