How can I use R to fill rows based on column? - r

I have the following table
Code Name Class
1
2 Monday day
5 green color
9
6
1 red color
1
2
9 Tuesday day
6
5
Goal is to the fill the Name and Class columns based on the Code column of a filled row. For example, the second row is filled and the code is 2. I would like to fill all the rows where code = 2 with Name=Monday and Class=day.
I tried using fill() from tidyR but that seems to require ordered data.
structure(list(Code = c(1L, 2L, 5L, 9L, 6L, 1L, 1L, 2L, 9L, 6L,
5L), Name = structure(c(1L, 3L, 2L, 1L, 1L, 4L, 1L, 1L, 5L, 1L,
1L), .Label = c("", "green", "Monday", "red", "Tuesday"), class = "factor"),
Class = structure(c(1L, 3L, 2L, 1L, 1L, 2L, 1L, 1L, 3L, 1L,
1L), .Label = c("", "color", "day"), class = "factor")), .Names = c("Code",
"Name", "Class"), class = "data.frame", row.names = c(NA, -11L
))

library(dplyr)
final_df <- left_join(df, df[df$Name!='',], by='Code')[,c(1,4:5)]
colnames(final_df) <- colnames(df)
final_df

Related

An R function to work out the preceding value

I'm trying to create a table of staff, who they report to and what level they are.
I've been working with a similar table, and #TonakShah was kind enough to help me with calculating the lowest level location is and the level above is using the solution below.
My employee table looks like this:
input = structure(list(Level.1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "Board", class = "factor"), Level.2 = structure(c(2L,
2L, 2L, 1L, 1L, 3L, 3L), .Label = c("Aasha", "Grace", "Marisol"
), class = "factor"), Level.3 = structure(c(4L, 4L, 3L, 1L, 1L,
2L, 2L), .Label = c("Alex", "Chandler", "Millie", "Tushad"), class = "factor"),
Level.4 = structure(c(2L, 2L, 6L, 1L, 5L, 3L, 4L), .Label = c("#",
"Frank", "Joey", "Rachel", "Sarah", "Tony"), class = "factor"),
Level.5 = structure(c(3L, 2L, 1L, 1L, 1L, 4L, 1L), .Label = c("#",
"Lela", "Millie", "Ross"), class = "factor"), Level.6 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = "#", class = "factor")), class = "data.frame", row.names = c(NA,
-7L))
and using the technique described here by Ronak (stackoverflow.com/questions/56903188/create-a-table-from-a-hierarchy/)
which is,
as.data.frame(t(apply(input, 1, function(x)
{new_x = x[x != "###"]; c(rev(tail(new_x, 2)), length(new_x)) })))
I can get most of the required table. But I'm having trouble trying to get "the bosses" (eg. those with employees but are not "the board").
My ideal output would look something like this (I've added colnames to make it easier to understand):
structure(list(Subordinate = structure(c(9L, 4L, 14L, 5L, 7L,
13L, 9L, 2L, 1L, 12L, 11L, 6L, 3L, 8L, 10L), .Label = c("Aasha",
"Alex", "Chandler", "Frank", "Grace", "Joey", "Lela", "Marisol",
"Millie", "Rachel", "Ross", "Sarah", "Tony", "Tushad"), class = "factor"),
Boss = structure(c(5L, 10L, 6L, 3L, 5L, 9L, 6L, 1L, 3L, 2L,
7L, 4L, 8L, 3L, 4L), .Label = c("Aasha", "Alex", "Board",
"Chandler", "Frank", "Grace", "Joey", "Marisol", "Millie",
"Tushad"), class = "factor"), Level = c(5L, 4L, 3L, 2L, 5L,
4L, 3L, 3L, 2L, 4L, 5L, 4L, 3L, 2L, 4L)), class = "data.frame", row.names = c(NA,
-15L))
I think I maybe do it with a loop, but this doesn't seem to be the best answer. Can anyone offer any other tips?
Couldn't come up with a prettier solution but this works. Using a while loop in the apply call used previously, we can do
output <- do.call(rbind.data.frame, apply(input, 1, function(x) {
new_x = as.character(x[x != "#"])
list_df <- list()
i = 1
while(length(new_x) >= 2) {
#Get last 2 eneteries
list_df[[i]] <- c(rev(tail(new_x, 2)), length(new_x))
#Go one level deeper
new_x = head(new_x, -1)
i = i +1
}
do.call(rbind, list_df)
}))
#To remove duplicate enteries
output[!duplicated(output), ]
# V1 V2 V3
#1 Millie Frank 5
#2 Frank Tushad 4
#3 Tushad Grace 3
#4 Grace Board 2
#5 Lela Frank 5
#9 Tony Millie 4
#10 Millie Grace 3
#12 Alex Aasha 3
#13 Aasha Board 2
#14 Sarah Alex 4
#17 Ross Joey 5
#18 Joey Chandler 4
#19 Chandler Marisol 3
#20 Marisol Board 2
#21 Rachel Chandler 4

Taking the frequency of three different columns [duplicate]

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 5 years ago.
I have a dataframe like this:
df <- structure(list(col1 = structure(c(1L, 1L, 2L, 3L, 1L, 3L, 1L,
3L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 4L), .Label = c("stock1",
"stock2", "stock3", "stock4"), class = "factor"), col2 = structure(c(4L,
5L, 7L, 6L, 5L, 5L, 5L, 6L, 6L, 8L, 8L, 4L, 3L, 3L, 1L, 2L, 3L
), .Label = c("comapny1", "comapny1+comapny4", "comapny4", "company1",
"company2", "company2+company1", "company3", "company4"), class = "factor"),
col3 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("predictor1", "predictor2"
), class = "factor")), .Names = c("col1", "col2", "col3"), class = "data.frame", row.names = c(NA,
-17L))
I would like to take the frequency from the three columns.
Expected output
df2 <- structure(list(col1 = structure(c(1L, 1L, 1L, 2L, 4L, 1L, 1L,
3L, 3L, 1L, 2L, 1L), .Label = c("stock1", "stock2", "stock3",
"stock4"), class = "factor"), col2 = structure(c(1L, 2L, 3L,
3L, 3L, 4L, 5L, 5L, 6L, 6L, 7L, 8L), .Label = c("comapany1",
"comapany1+comapany4", "comapany4", "company1", "company2", "company2+company1",
"company3", "company4"), class = "factor"), col3 = structure(c(2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("predictor1",
"predictor2"), class = "factor"), frequency = c(1L, 1L, 1L, 1L,
1L, 2L, 3L, 1L, 2L, 1L, 1L, 2L)), .Names = c("col1", "col2",
"col3", "frequency"), class = "data.frame", row.names = c(NA,
-12L))
How is it possible to make it?
We can use count
library(dplyr)
count(df, col1, col2, col3)
# A tibble: 12 x 4
# col1 col2 col3 n
# <fctr> <fctr> <fctr> <int>
# 1 stock1 comapny1 predictor2 1
# 2 stock1 comapny1+comapny4 predictor2 1
# 3 stock1 comapny4 predictor2 1
# 4 stock1 company1 predictor1 2
# 5 stock1 company2 predictor1 3
# 6 stock1 company2+company1 predictor1 1
# 7 stock1 company4 predictor1 2
# 8 stock2 comapny4 predictor2 1
# 9 stock2 company3 predictor1 1
#10 stock3 company2 predictor1 1
#11 stock3 company2+company1 predictor1 2
#12 stock4 comapny4 predictor2 1
Or with data.table
library(data.table)
setDT(df)[, .N, .(col1, col2, col3)]

Summarise/reshape data in R

For an example dataframe:
df <- structure(list(id = 1:18, region = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("a",
"b"), class = "factor"), age.cat = structure(c(1L, 1L, 2L, 2L,
2L, 3L, 3L, 4L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 4L, 4L, 4L), .Label = c("0-18",
"19-35", "36-50", "50+"), class = "factor")), .Names = c("id",
"region", "age.cat"), class = "data.frame", row.names = c(NA,
-18L))
I want to reshape the data, as detailed below:
region 0-18 19-35 36-50 50+
a 2 3 2 1
b 4 2 1 3
Do I simply aggregate or reshape the data? Any help would be much appreciated.
You can do it just using table:
table(df$region, df$age.cat)
0-18 19-35 36-50 50+
a 2 3 2 1
b 4 2 1 3
Using reshape2:
install.packages('reshape2')
library(reshape2)
df1 <- melt(df, measure.vars = 'age.cat')
df1 <- dcast(df1, region ~ value)

Indicator feature creation in R based on multiple columns

I have a dataset with 10 columns and out of them 10, 3 are of interest to create a new indicator feature. The features are "pT", "pN", & "M" and they all take different values. Off all the values that these 3 features take, there are a toal of 9 unique combinations that needs to be captures in the new variable.
PATHOT PATHON PATHOM
1 pT2 pN1 M0
4 pT1 pN1 M0
13 pT3 pN1 M0
161 pT1 *pN2 M0
391 pT1 pN1 *M1
810 *pTIS pN1 M0
948 pT3 *pN2 M0
1043 pT2 pN1 *M1
1067 *pT4 pN1 M0
For example, the new variable will have value "1" when PATHOT=pT2, PATHON=pN1 & PATHOM=M0 and so on upto value 9. I have completed the task but after spending almost 20 lines of code involving vectorised operation for all unique combinations.
diag3_bs$sfd[diag3_bs$pathot=="pT2" & diag3_bs$pathon=="pN1" &
diag3_bs$pathom=="M0"] <- 1
diag3_bs$sfd[diag3_bs$pathot=="pT1" & diag3_bs$pathon=="pN1" &
diag3_bs$pathom=="M0"] <- 2
diag3_bs$sfd[diag3_bs$pathot=="pT3" & diag3_bs$pathon=="pN1" &
diag3_bs$pathom=="M0"] <- 3... so on upto 9.
I want to ask if there is a better more automated way of getting the same result?
dput(data.frame) is given below
structure(list(F_STATUS = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = "Y", class = "factor"), EVENT_ID = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "BASELINE", class =
"factor"),
PAG_NAME = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "BR2", class = "factor"), PTSIZE = c(3, 4,
2.7, 2, 0.9, 3, 3, 0.9, 3, 4.5), PTSIZE_U = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "CM", class = "factor"),
PT_SYM = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("", "-", "<", ">"), class = "factor"), PATHOT = structure(c(4L,
4L, 4L, 3L, 3L, 4L, 4L, 3L, 4L, 4L), .Label = c("*pT4", "*pTIS",
"pT1", "pT2", "pT3"), class = "factor"), PATHON = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("*pN2", "pN1"
), class = "factor"), PATHOM = structure(c(2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("*M1", "M0"), class = "factor"),
RSUBJID = 901000:901009, RUSUBJID = structure(1:10, .Label = c(
"000301-000-901-251", "000301-000-901-252", "000301-000-901-253",
"000301-000-901-254", "000301-000-901-255", "000301-000-901-256",
"000301-000-901-257", "000301-000-901-258", "000301-000-901-259",
"000301-000-901-260", "000301-000-901-261", "000301-000-901-262")
, class = "factor")), .Names = c("F_STATUS", "EVENT_ID", "PAG_NAME", "PTSIZE", "PTSIZE_U", "PT_SYM", "PATHOT",
"PATHON", "PATHOM", "RSUBJID", "RUSUBJID"), row.names = c(NA, 10L),
class = "data.frame")
Thanks.
I tried to edit the data so it didn't throw an error on input. Also created a version of that tabulation of possible combinations:
stg_tbl <- structure(list(PATHOT = structure(c(4L, 3L, 5L, 3L, 3L, 2L, 5L,
4L, 1L), .Label = c("*pT4", "*pTIS", "pT1", "pT2", "pT3"), class = "factor"),
PATHON = structure(c(2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L), .Label = c("*pN2",
"pN1"), class = "factor"), PATHOM = structure(c(2L, 2L, 2L,
2L, 1L, 2L, 2L, 1L, 2L), .Label = c("*M1", "M0"), class = "factor")), .Names = c("PATHOT",
"PATHON", "PATHOM"), class = "data.frame", row.names = c("1",
"4", "13", "161", "391", "810", "948", "1043", "1067"))
Make a vector of text-equivalents of the categories:
stg_lbls <- with(stg_tbl, paste(PATHOT, PATHON, PATHOM, sep="_") )
Then the as.numeric values of a factor created using those levels will be the desired result:
dat$stg <- with(dat, factor( paste(PATHOT, PATHON, PATHOM, sep="_"), levels=stg_lbls))
as.numeric(dat$stg)
#[1] 1 1 1 2 2 1 1 2 1 1
You can just assign those values in the usual way:
dat$sfd <- as.numeric(dat$stg)
I made some new data, that should be useful for your problem.
k<-expand.grid(data.frame(a=letters[1:3],b=letters[4:6],c=letters[7:9]))
library(dplyr)
k %>% mutate(groups=paste0(a,b,c))->k2
k2$groups<-as.numeric(factor(k2$groups))
k2
It's crude, and you're not picking which combination get's which numbers, so it'd take some digging afterwards, but it's quick.

Bin data by (x,y) and summarize

These are the first 10 lines of a huge files I have: (Note that there is only one user in these 10 lines but I've got thousands of users)
dput(testd)
structure(list(user = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L
), otime = structure(c(10L, 9L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L
), .Label = c("2010-10-12T19:56:49Z", "2010-10-13T03:57:23Z",
"2010-10-13T16:41:35Z", "2010-10-13T20:05:43Z", "2010-10-13T23:31:51Z",
"2010-10-14T00:21:47Z", "2010-10-14T18:25:51Z", "2010-10-16T03:48:54Z",
"2010-10-16T06:02:04Z", "2010-10-17T01:48:53Z"), class = "factor"),
lat = c(39.747652, 39.891383, 39.891077, 39.750469, 39.752713,
39.752508, 39.7513, 39.758974, 39.827022, 39.749934),
long = c(-104.99251, -105.070814, -105.068532, -104.999073,
-104.996337, -104.996637, -105.000121, -105.010853,
-105.143191, -105.000017),
locid = structure(c(5L, 4L, 9L, 6L, 1L, 2L, 8L, 3L, 10L, 7L),
.Label = c("2ef143e12038c870038df53e0478cefc",
"424eb3dd143292f9e013efa00486c907", "6f5b96170b7744af3c7577fa35ed0b8f",
"7a0f88982aa015062b95e3b4843f9ca2", "88c46bf20db295831bd2d1718ad7e6f5",
"9848afcc62e500a01cf6fbf24b797732f8963683", "b3d356765cc8a4aa7ac5cd18caafd393",
"d268093afe06bd7d37d91c4d436e0c40d217b20a", "dd7cd3d264c2d063832db506fba8bf79",
"f6f52a75fd80e27e3770cd3a87054f27"), class = "factor"),
dnt = structure(c(10L, 9L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L),
.Label = c("2010-10-12 19:56:49",
"2010-10-13 03:57:23", "2010-10-13 16:41:35", "2010-10-13 20:05:43",
"2010-10-13 23:31:51", "2010-10-14 00:21:47", "2010-10-14 18:25:51",
"2010-10-16 03:48:54", "2010-10-16 06:02:04", "2010-10-17 01:48:53"
), class = "factor"),
x = c(-11674.6344476781, -11683.3414552141,
-11683.0877083915, -11675.3642199817, -11675.0599906624,
-11675.0933491404, -11675.4807522648, -11676.6740962175,
-11691.3894104198, -11675.4691879924),
y = c(4419.73724843345, 4435.719406435, 4435.68538078744,
4420.05048454181, 4420.3000059572, 4420.27721099723,
4420.14288752585, 4420.99619739292, 4428.56278976123,
4419.99099525605),
cellx = structure(c(1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L),
.Label = c("[-11682,-11672)", "[-11692,-11682)"
), class = "factor"),
celly = structure(c(1L, 2L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("[4419,4429)", "[4429,4439)"
), class = "factor"),
cellxy = structure(c(1L, 3L, 3L, 1L,
1L, 1L, 1L, 1L, 2L, 1L), .Label = c("[-11682,-11672)[4419,4429)",
"[-11692,-11682)[4419,4429)", "[-11692,-11682)[4429,4439)"
), class = "factor")), .Names = c("user", "otime", "lat",
"long", "locid", "dnt", "x", "y", "cellx", "celly", "cellxy"), class = "data.frame", row.names = c(NA,
-10L))
A bit of explanation on what the data is to simplify understanding. The x and y are transformation of the lat and long coordinates. I have discretised the x,y locations into bins using cut. I want to get the most visited bin per user so I use ddply. As follows:
cells = ddply(testd, .(user, cellxy), summarise, length(cellxy))
Obtaining:
dput(cells)
structure(list(user = c(0, 0, 0), cellxy = structure(1:3, .Label = c("[-11682,-11672)[4419,4429)",
"[-11692,-11682)[4419,4429)", "[-11692,-11682)[4429,4439)"), class = "factor"),
count = c(7L, 1L, 2L)), .Names = c("user", "cellxy", "count"
), row.names = c(NA, -3L), class = "data.frame")
Now what I want to do is calculate the average x,y from the first dataset for the most visited bin per user as obtained from the previous calculation. I have no idea how to do this efficiently and given that my dataset is really big I would appreciate some guidance. Thanks!
Here is two stage approach. First, modified your original code of cells - for each combination of cellxy and user calculate mean x and y value.
cells = ddply(testd, .(user, cellxy), summarise,
cellcount=length(cellxy),meanx=mean(x),meany=mean(y))
cells
user cellxy cellcount meanx meany
1 0 [-11682,-11672)[4419,4429) 7 -11675.40 4420.214
2 0 [-11692,-11682)[4419,4429) 1 -11691.39 4428.563
3 0 [-11692,-11682)[4429,4439) 2 -11683.21 4435.702
Then use other call to ddply() to subset for each user cellxy with highest cellcount.
cells2 = ddply(cells,.(user),subset,cellcount==max(cellcount))
cells2
user cellxy cellcount meanx meany
1 0 [-11682,-11672)[4419,4429) 7 -11675.4 4420.214
since your data set is large, you might want to consider data.table, which not only will be blazing fast, it will also make the data mungling a bit easier.
Converting to a data table is straight forward:
library (data.table)
DT <- data.table(testd, by="user")
Then determining the most visited, by user, is just one line
# Determining which is the most visited, by user
DT[, "MostVisited" := {counts <- table(cellxy); names(counts)[which(counts==max(counts))]}, by=user]
I'm not sure how specifically you want to calculate the average x, y relative to the MostVisited, but I'm sure that as well could be relatively straight forward with data.table.
## But perhaps something like this
DT[, c("AvgX", "AvgY") := list(mean(x), mean(y)), by=list(user, MostVisited)]

Resources