Rearanging Data - r

I am trying to rearrange a data set and then sort it on multiple variables. For example, right now I have something that looks like this:
ID Name Class 1 Class2 Monday 7-8 Monday 8-9
1 Brad Chem Bio Monday 7-8 NA
2 Charlene Acct NA NA Monday 8-9
3 Carly Philosophy Physics NA NA
4 Jess Chem Acct Monday 7-8 Monday 8-9
And sort the data like this:
Class Monday 7-8 Monday 8-9
Acct Jess Charlene, Jess
Bio Brad NA
Chem Brad, Jess Jess
Philosophy NA NA
Physics NA NA
I have tried separating all of the variables into different spreadsheets and then merging them, but I cant figure out how to sort the name based on both class and time and it is proving incredibly difficult to figure out. The actual database is composed of about 70 different time options with 80 different people and 150 different class names (chem, bio, etc), so I cant go in and create this individually

a tidyr solution:
df1 %>%
gather(class_col,Class,'Class.1','Class2') %>%
filter(!is.na(Class)) %>%
gather(date_col,date,'Monday.7.8','Monday.8.9') %>%
group_by(Class,date) %>%
summarize(Name = paste(Name,collapse=", ")) %>%
spread(date,Name) %>%
select(-`<NA>`)
# # A tibble: 5 x 3
# # Groups: Class [5]
# Class `Monday 7-8` `Monday 8-9`
# * <chr> <chr> <chr>
# 1 Acct Jess Charlene, Jess
# 2 Bio Brad <NA>
# 3 Chem Brad, Jess Jess
# 4 Philosophy <NA> <NA>
# 5 Physics <NA> <NA>

Here is some base R code for this task:
dat <- data.frame(
name=c("Brad", "Charlene", "Carly", "Jess"),
class1=c("Chem", "Acct", "Philosophy", "Chem"),
class2=c("Bio", NA, "Physics", "Acct"),
monday7.8=c("monday7.8", NA, NA, "monday7.8"),
monday8.9=c(NA, "monday8.9", NA, "monday8.9"),
stringsAsFactors=FALSE
)
classes <- c("Chem", "Acct", "Philosophy", "Physics")
times <- c("monday7.8", "monday8.9")
ret <- expand.grid(class=classes, time=times, stringsAsFactors=FALSE)
one_alloc <- function(cl, tm, dat) {
idx <- which(!is.na(dat[,tm]) & (dat[,"class1"]==cl | dat[,"class2"]==cl))
if(length(idx)>0) return(paste(dat[idx,"name"], collapse=", ")) else return(NA)
}
one_alloc <- Vectorize(one_alloc, vectorize.args=c("cl", "tm"))
ret[,"names"] <- one_alloc(cl=ret[,"class"], tm=ret[,"time"], dat=dat)
ret <- reshape(ret, timevar="time", idvar="class", direction="wide")
ret

Related

How to do a frequency table where column values are variables?

I have a DF named JOB. In that DF i have 4 columns. Person_ID; JOB; FT (full time or part time with values of 1 for full time and 2 for part time) and YEAR. Every person can have only 1 full time job per year in this DF. This is the full time job they got most of their income during the year.
DF
PERSON_ID JOB FT YEAR
1 Analyst 1 2018
1 Analyst 1 2019
1 Analyst 1 2020
2 Coach 1 2018
2 Coach 1 2019
2 Analyst 1 2020
3 Gardener 1 2020
4 Coach 1 2018
4 Coach 1 2019
4 Analyst 1 2020
4 Coach 2 2019
4 Gardener 2 2019
I want to get different frequency in the lines of the following question:
What full time job changes occurred from 2019 and 2020?
I want to look only at changes where FT=1.
I want my end table to look like this
2019 2020 frequency
Analyst Analyst 1
Coach Analyst 2
NA Gardener 1
I want to look at the data so that i can say 2 people moved from they coaching job to analyst job. 1 analyst did not change their job and one person entered the labour market as a gardener.
I tried to fiddle around with the table function but did not even get close to what i wanted. I could not get the YEAR's to go to separate variables.
10 Bonus points if i can do it in base R :)
Thank you for your help
Not pretty but worked:
# split df by year
df_2019 <- df[df$YEAR %in% c(2019) & df$FT == 1, ]
df_2020 <- df[df$YEAR %in% c(2020) & df$FT == 1, ]
# rename Job columns
df_2019$JOB_2019 <- df_2019$JOB
df_2020$JOB_2020 <- df_2020$JOB
# select needed columns
df_2019 <- df_2019[, c("PERSON_ID", "JOB_2019")]
df_2020 <- df_2020[, c("PERSON_ID", "JOB_2020")]
# merge dfs
df2 <- merge(df_2019, df_2020, by = "PERSON_ID", all = TRUE)
df2$frequency <- 1
df2$JOB_2019 <- addNA(df2$JOB_2019)
df2$JOB_2020 <- addNA(df2$JOB_2020)
# aggregate frequency
aggregate(frequency ~ JOB_2019 + JOB_2020, data = df2, FUN = sum, na.action=na.pass)
JOB_2019 JOB_2020 frequency
1 Analyst Analyst 1
2 Coach Analyst 2
3 <NA> Gardener 1
Not R base but worked:
library(dplyr)
library(tidyr)
data %>%
filter(FT==1, YEAR %in% c(2019, 2020)) %>%
group_by(YEAR, JOB, PERSON_ID) %>%
tally() %>%
pivot_wider(names_from = YEAR, values_from = JOB) %>%
select(-PERSON_ID) %>%
group_by(`2019`, `2020`) %>%
summarise(n = n())
`2019` `2020` n
<chr> <chr> <int>
1 Analyst Analyst 1
2 Coach Analyst 2
3 NA Gardener 1

How can I count a variable in R conditional on the value of another variable?

I want to count occurrences of a variable in a dataframe by another variable, conditional on the value of a third variable. Here is my data:
Name Store Purchase Date
John CVS Shampoo 1/1/2001
John CVS Toothpaste 1/1/2001
John Whole Foods Kombucha 1/1/2005
John Kroger Ice Cream 1/1/2002
Jane CVS Soap 1/1/2001
Jane Whole Foods Crackers 1/1/2004
For each purchase, I want a count of how many previous purchases were made by the specified person, and how many previous shopping trips, like this:
Name Store Purchase Date Prev_Purchase Prev_trip
John CVS Shampoo 1/1/2001 0 0
John CVS Toothpaste 1/1/2001 0 0
John Whole Foods Kombucha 1/1/2005 3 2
John Kroger Ice Cream 1/1/2002 2 1
Jane CVS Soap 1/1/2001 0 0
Jane Whole Foods Crackers 1/1/2004 1 1
If I wanted the total number of purchases/trips for each person, I would use count or tapply--is there a way to adapt these functions so that the outputs are conditional on a third variable (date)?
Maybe you can try the base R code using ave
transform(df,
Prev_Purchase = ave(as.numeric(as.Date(Date, "%d/%m/%Y")), Name, FUN = function(x) sapply(x, function(p) sum(p > x))),
Prev_trip = ave(as.numeric(as.Date(Date, "%d/%m/%Y")), Name, FUN = function(x) sapply(x, function(p) length(unique(x[p > x]))))
)
which gives
Name Store Purchase Date Prev_Purchase Prev_trip
1 John CVS Shampoo 1/1/2001 0 0
2 John CVS Toothpaste 1/1/2001 0 0
3 John Whole Foods Kombucha 1/1/2005 3 2
4 John Kroger Ice Cream 1/1/2002 2 1
5 Jane CVS Soap 1/1/2001 0 0
6 Jane Whole Foods Crackers 1/1/2004 1 1
Data
df <- structure(list(Name = c("John", "John", "John", "John", "Jane",
"Jane"), Store = c("CVS", "CVS", "Whole Foods", "Kroger", "CVS",
"Whole Foods"), Purchase = c("Shampoo", "Toothpaste", "Kombucha",
"Ice Cream", "Soap", "Crackers"), Date = c("1/1/2001", "1/1/2001",
"1/1/2005", "1/1/2002", "1/1/2001", "1/1/2004")), class = "data.frame", row.names = c(NA,
-6L))
I think it should solve your problem. If your data is huge it's better if you optimize this code chunk.
# load environment
library(lubridate)
# base function
AddInfo = function(name, date, df) {
prev_purchase = sum(df$Name == name & df$Date < date)
prev_trip = length(unique(filter(df, Name == name & Date < date)$Date))
data = data.frame(
Prev_purchase = prev_purchase,
Prev_trip = prev_trip
)
return(data)
}
# define data frame
df = data.frame(
Name = c(rep('John', 4), rep('Jane', 2)),
Store = c('CVS', 'CVS', 'Whole Foods', 'Kroger', 'CVS', 'Whole Foods'),
Purchase = c('Shampoo', 'Toothpaste', 'Kombucha', 'Ice Cream', 'Soap', 'Crackers'),
Date = c('1/1/2001', '1/1/2001', '1/1/2005', '1/1/2002', '1/1/2001', '1/1/2004')
)
# transform date to POSIXct
df$Date = dmy(df$Date)
# apply function and bind the results
cols = mapply(AddInfo, df$Name, df$Date, MoreArgs = list(df), SIMPLIFY = FALSE)
cols = bind_rows(cols)
df = cbind(df, cols)
Here is the output:
Name Store Purchase Date Prev_purchase Prev_trip
1 John CVS Shampoo 1/1/2001 0 0
2 John CVS Toothpaste 1/1/2001 0 0
3 John Whole Foods Kombucha 1/1/2005 3 2
4 John Kroger Ice Cream 1/1/2002 2 1
5 Jane CVS Soap 1/1/2001 0 0
6 Jane Whole Foods Crackers 1/1/2004 1 1
We could also use outer
library(dplyr)
library(lubridate)
df %>%
mutate(Date = dmy(Date)) %>%
group_by(Name) %>%
mutate(Prev_Purchase = colSums(outer(Date, Date, FUN = "<")),
Prev_trip = colSums(outer(unique(Date), Date, FUN = "<")))
# A tibble: 6 x 6
# Groups: Name [2]
# Name Store Purchase Date Prev_Purchase Prev_trip
# <chr> <chr> <chr> <date> <dbl> <dbl>
#1 John CVS Shampoo 2001-01-01 0 0
#2 John CVS Toothpaste 2001-01-01 0 0
#3 John Whole Foods Kombucha 2005-01-01 3 2
#4 John Kroger Ice Cream 2002-01-01 2 1
#5 Jane CVS Soap 2001-01-01 0 0
#6 Jane Whole Foods Crackers 2004-01-01 1 1

Partial string matching between data frames with match() or similar to preserve match positions

Using function match() I want to perform partial string matching between two character vectors of different data frames.
The position of the matched value has to be preserved as it is later used to reference the neighbouring columns, I found the function match() works best for that.
I can do exact string matching:
## exact string matching
name <- c("AAB", "AAC", "AAD","AAE")
meaning1 <- c('circular','parallel','perpendicular','none')
meaning2 <- c('surface','longitudinal','transverse','not detected')
meaning3 <- c('category 1','category 1','category 1','category 2')
referenceData <- data.frame(name, meaning1, meaning2, meaning3, stringsAsFactors = FALSE)
name2 <- c("AAB", "AAC", "AAD","AAE")
myData <- data.frame(name2, stringsAsFactors = FALSE)
> referenceData
name meaning1 meaning2 meaning3
1 AAB circular surface category 1
2 AAC parallel longitudinal category 1
3 AAD perpendicular transverse category 1
4 AAE none not detected category 2
> myData
name2
1 AAB
2 AAC
3 AAD
4 AAE
matched <- match(myData[ , 'name2'], referenceData[ ,'name'])
> matched
[1] 1 2 3 4
myData$newCol <- referenceData$meaning1[matched]
myData$newCol2 <- referenceData$meaning2[matched]
> myData
name2 newCol newCol2
1 AAB circular surface
2 AAC parallel longitudinal
3 AAD perpendicular transverse
4 AAE none not detected
However the real data has a small complication and can only be partially matched so my above method won't work:
name2 <- c("AAB Monday and Thursday", "AAC Saturday", "AAD Wednesday", "AAE Friday")
myData <- data.frame(name2, stringsAsFactors = FALSE)
> myData
name2
1 AAB Monday and Thursday
2 AAC Saturday
3 AAD Wednesday
4 AAE Friday
matched <- match(myData[ , 'name2'], referenceData[ ,'name'])
> matched
[1] NA NA NA NA
myData$newCol <- referenceData$meaning1[matched]
myData$newCol2 <- referenceData$meaning2[matched]
> myData
name2 newCol newCol2
1 AAB Monday and Thursday <NA> <NA>
2 AAC Saturday <NA> <NA>
3 AAD Wednesday <NA> <NA>
4 AAE Friday <NA> <NA>
Can match() be combined with regex somehow to do the partial matching?
EDIT
The reproducible example was oversimplified. A more representative content would be:
name2 <- c("AAB Monday and Thursday", "AAC Saturday", "AAD Wednesday", "AAE Friday","AAB Monday and Thursday","AAB Monday and Thursday")
myData <- data.frame(name2, stringsAsFactors = FALSE)
> myData
name2
1 AAB Monday and Thursday
2 AAC Saturday
3 AAD Wednesday
4 AAE Friday
5 AAB Monday and Thursday
6 AAB Monday and Thursday
You could use sapply and grep like this:
sapply(referenceData[, 'name'], grep, myData[, 'name2'])
Note that I inverted the order of the arguments. "AAB" as a regexp matches "AAB Monday and Thursday", but not vice versa
Edit: given your edit, if you know you always matching just the first three characters, you might try this simple approach (no partial match necessary):
first3 <- substr(myData[ , 'name2'], 1, 3)
match(first3, referenceData[ ,'name'])

R Cleaning and reordering names/serial numbers in data frame

Let's say I have a data frame as follows in R:
Data <- data.frame("SerialNum" = character(), "Year" = integer(), "Name" = character(), stringsAsFactors = F)
Data[1,] <- c("983\n837\n424\n ", 2015, "Michael\nLewis\nPaul\n ")
Data[2,] <- c("123\n456\n789\n136", 2014, "Elaine\nJerry\nGeorge\nKramer")
Data[3,] <- c("987\n654\n321\n975\n ", 2010, "John\nPaul\nGeorge\nRingo\nNA")
Data[4,] <- c("424\n983\n837", 2015, "Paul\nMichael\nLewis")
Data[5,] <- c("456\n789\n123\n136", 2014, "Jerry\nGeorge\nElaine\nKramer")
What I want to do is the following:
Split up each string of names and each string of serial numbers so that they are their own vectors (or a list of string vectors).
Eliminate any character "NA" in either set of vectors or any blank spaces denoted by "...\n ".
Reorder each list of names alphabetically and reorder the corresponding serial numbers according to the same permutation.
Concatenate each vector in the same fashion it was originally (I usually do this with paste(., collapse = "\n")).
My issue is how to do this without using a for loop. What is an object-oriented way to do this? As a first attempt in this direction I originally made a list by the command LIST <- strsplit(Data$Name, split = "\n") and from here I need a for loop in order to find the permutations of the names, which seems like a process that won't scale according to my actual data. Additionally, once I make the list LIST I'm not sure how I go about removing NA symbols or blank spaces. Any help is appreciated!
Using lapply I take each row of the data frame and turn it into a new data frame with one name per row. This creates a list of 5 data frames, one for each row of the original data frame.
seinfeld = lapply(1:nrow(Data), function(i) {
# Turn strings into data frame with one name per row
dat = data.frame(SerialNum=unlist(strsplit(Data[i,"SerialNum"], split="\n")),
Year=Data[i,"Year"],
Name=unlist(strsplit(Data[i,"Name"], split="\n")))
# Get rid of empty strings and NA values
dat = dat[!(dat$Name %in% c(""," ","NA")), ]
# Order alphabetically
dat = dat[order(dat$Name), ]
})
UPDATE: Based on your comment, let me know if this is the result you're trying to achieve:
seinfeld = lapply(1:nrow(Data), function(i) {
# Turn strings into data frame with one name per row
dat = data.frame(SerialNum=unlist(strsplit(Data[i,"SerialNum"], split="\n")),
Name=unlist(strsplit(Data[i,"Name"], split="\n")))
# Get rid of empty strings and NA values
dat = dat[!(dat$Name %in% c(""," ","NA")), ]
# Order alphabetically
dat = dat[order(dat$Name), ]
# Collapse back into a single row with the new sort order
dat = data.frame(SerialNum=paste(dat[, "SerialNum"], collapse="\n"),
Year=Data[i, "Year"],
Name=paste(dat[, "Name"], collapse="\n"))
})
do.call(rbind, seinfeld)
SerialNum Year Name
1 837\n983\n424 2015 Lewis\nMichael\nPaul
2 123\n789\n456\n136 2014 Elaine\nGeorge\nJerry\nKramer
3 321\n987\n654\n975 2010 George\nJohn\nPaul\nRingo
4 837\n983\n424 2015 Lewis\nMichael\nPaul
5 123\n789\n456\n136 2014 Elaine\nGeorge\nJerry\nKramer
eipi10 offered a great answer. In addition to that, I'd like to leave what I tried mainly with data.table. First, I split two columns (i.e., SerialNum and Name) with cSplit(), added an index with add_rownames(), and split the data by the index. In the first lapply(), I used Stacked() from the splitstackshape package. I stacked SerialNum and Name; separated SeriaNum and Name become two columns, as you see in a part of temp2. In the second lapply(), I used merge from the data.table package. Then, I removed rows with NAs (lapply(na.omit)), combined all data tables (rbindlist), and changed order of rows by rowname, which is row number of the original data) and Name (setorder(rowname, Name))
library(data.table)
library(splitstackshape)
library(dplyr)
cSplit(mydf, c("SerialNum", "Name"), direction = "wide",
type.convert = FALSE, sep = "\n") %>%
add_rownames %>%
split(f = .$rowname) -> temp
#a part of temp
#$`1`
#Source: local data frame [1 x 12]
#
#rowname Year SerialNum_1 SerialNum_2 SerialNum_3 SerialNum_4 SerialNum_5 Name_1 Name_2
#(chr) (dbl) (chr) (chr) (chr) (chr) (chr) (chr) (chr)
#1 1 2015 983 837 424 NA NA Michael Lewis
#Variables not shown: Name_3 (chr), Name_4 (chr), Name_5 (chr)
lapply(temp, function(x){
Stacked(x, var.stubs = c("SerialNum", "Name"), sep = "_")
}) -> temp2
# A part of temp2
#$`1`
#$`1`$SerialNum
# rowname Year .time_1 SerialNum
#1: 1 2015 1 983
#2: 1 2015 2 837
#3: 1 2015 3 424
#4: 1 2015 4 NA
#5: 1 2015 5 NA
#
#$`1`$Name
# rowname Year .time_1 Name
#1: 1 2015 1 Michael
#2: 1 2015 2 Lewis
#3: 1 2015 3 Paul
#4: 1 2015 4 NA
#5: 1 2015 5 NA
lapply(1:nrow(mydf), function(x){
merge(temp2[[x]]$SerialNum, temp2[[x]]$Name, by = c("rowname", "Year", ".time_1"))
}) %>%
lapply(na.omit) %>%
rbindlist %>%
setorder(rowname, Name) -> out
print(out)
# rowname Year .time_1 SerialNum Name
# 1: 1 2015 2 837 Lewis
# 2: 1 2015 1 983 Michael
# 3: 1 2015 3 424 Paul
# 4: 2 2014 1 123 Elaine
# 5: 2 2014 3 789 George
# 6: 2 2014 2 456 Jerry
# 7: 2 2014 4 136 Kramer
# 8: 3 2010 3 321 George
# 9: 3 2010 1 987 John
#10: 3 2010 2 654 Paul
#11: 3 2010 4 975 Ringo
#12: 4 2015 3 837 Lewis
#13: 4 2015 2 983 Michael
#14: 4 2015 1 424 Paul
#15: 5 2014 3 123 Elaine
#16: 5 2014 2 789 George
#17: 5 2014 1 456 Jerry
#18: 5 2014 4 136 Kramer
DATA
mydf <- structure(list(SerialNum = c("983\n837\n424\n ", "123\n456\n789\n136",
"987\n654\n321\n975\n ", "424\n983\n837", "456\n789\n123\n136"
), Year = c(2015, 2014, 2010, 2015, 2014), Name = c("Michael\nLewis\nPaul\n ",
"Elaine\nJerry\nGeorge\nKramer", "John\nPaul\nGeorge\nRingo\nNA",
"Paul\nMichael\nLewis", "Jerry\nGeorge\nElaine\nKramer")), .Names = c("SerialNum",
"Year", "Name"), row.names = c(NA, -5L), class = "data.frame")

How do I infill non-adjacent rows with sample data from previous rows in R?

I have data containing a unique identifier, a category, and a description.
Below is a toy dataset.
prjnumber <- c(1,2,3,4,5,6,7,8,9,10)
category <- c("based","trill","lit","cold",NA,"epic", NA,NA,NA,NA)
description <- c("skip class",
"dunk on brayden",
"record deal",
"fame and fortune",
NA,
"female attention",
NA,NA,NA,NA)
toy.df <- data.frame(prjnumber, category, description)
> toy.df
prjnumber category description
1 1 based skip class
2 2 trill dunk on brayden
3 3 lit record deal
4 4 cold fame and fortune
5 5 <NA> <NA>
6 6 epic female attention
7 7 <NA> <NA>
8 8 <NA> <NA>
9 9 <NA> <NA>
10 10 <NA> <NA>
I want to randomly sample the 'category' and 'description' columns from rows that have been filled in to use as infill for rows with missing data.
The final data frame would be complete and would only rely on the initial 5 rows which contain data. The solution would preserve between-column correlation.
An expected output would be:
> toy.df
prjnumber category description
1 1 based skip class
2 2 trill dunk on brayden
3 3 lit record deal
4 4 cold fame and fortune
5 5 lit record deal
6 6 epic female attention
7 7 based skip class
8 8 based skip class
9 9 lit record deal
10 10 trill dunk on brayden
complete = na.omit(toy.df)
toy.df[is.na(toy.df$category), c("category", "description")] =
complete[sample(1:nrow(complete), size = sum(is.na(toy.df$category)), replace = TRUE),
c("category", "description")]
toy.df
# prjnumber category description
# 1 1 based skip class
# 2 2 trill dunk on brayden
# 3 3 lit record deal
# 4 4 cold fame and fortune
# 5 5 lit record deal
# 6 6 epic female attention
# 7 7 cold fame and fortune
# 8 8 based skip class
# 9 9 epic female attention
# 10 10 epic female attention
Though it would seem a little more straightforward if you didn't start with the unique identifiers filled out for the NA rows...
You could try
library(dplyr)
toy.df %>%
mutate_each(funs(replace(., is.na(.), sample(.[!is.na(.)]))), 2:3)
Based on new information, we may need a numeric index to use in the funs.
toy.df %>%
mutate(indx= replace(row_number(), is.na(category),
sample(row_number()[!is.na(category)], replace=TRUE))) %>%
mutate_each(funs(.[indx]), 2:3) %>%
select(-indx)
Using Base R to fill in a single field a at a time, use something like (not preserving the correlation between the fields):
fields <- c('category','description')
for(field in fields){
missings <- is.na(toy.df[[field]])
toy.df[[field]][missings] <- sample(toy.df[[field]][!missings],sum(missings),T)
}
and to fill them in simultaneously (preserving the correlation between the fields) use something like:
missings <- apply(toy.df[,fields],
1,
function(x)any(is.na(x)))
toy.df[missings,fields] <- toy.df[!missings,fields][sample(sum(!missings),
sum(missings),
T),]
and of course, to avoid the implicit for loop in the apply(x,1,fun), you could use:
rowAny <- function(x) rowSums(x) > 0
missings <- rowAny(toy.df[,fields])

Resources