Lookup and replace column values using dplyr in r - r

Although I found few discussions but couldn't find a proper solution within dplyr.
My main table consists of more than 50 columns and have 15 lookup tables. Each lookup tables has around 8-15 columns. I have multiple lookups to perform and since it becomes really messy with select statements (either by selecting or removing with a minus), I would like to be able to replace column values on the fly.
Is this possible using dplyr? I have provided below just a sample data for better understanding.
I would like to do VLOOKUP (like excel) with city in table with lcity in lookup and replace values of city with newcity.
> table <- data.frame(name = c("a","b","c","d","e","f"), city = c("hyd","sbad","hyd","sbad","others","unknown"), rno = c(101,102,103,104,105,106),stringsAsFactors=FALSE)
>lookup <- data.frame(lcity = c("hyd","sbad","others","test"),newcity = c("nhyd","nsbad","nothers","ntest"),rating = c(10,20,40,55),newrating = c(100,200,400,550), stringsAsFactors = FALSE)
> table
name city rno
1 a hyd 101
2 b sbad 102
3 c hyd 103
4 d sbad 104
5 e others 105
6 f unknown 106
> lookup
lcity newcity rating newrating
1 hyd nhyd 10 100
2 sbad nsbad 20 200
3 others nothers 40 400
4 test ntest 55 550
My output table should be
name city rno
1 a nhyd 101
2 b nsbad 102
3 c nhyd 103
4 d nsbad 104
5 e nothers 105
6 f <NA> 106
I have tried below code for updating values on the fly, but this creates another dataframe/table instead of a character vector
table$city <- select(left_join(table,lookup,by=c("city"="lcity")),"newcity")

One solution could be:
Note: The data shown by OP and created with commands are different for lookup. I have used the data shown for lookup in tabular format by OP.
library(dplyr)
# Data from OP
table <- data.frame(name = c("a","b","c","d","e","f"),
city = c("hyd","sbad","hyd","sbad","others","unknown"),
rno = c(101,102,103,104,105,106),stringsAsFactors=FALSE)
lookup <- data.frame(lcity = c("hyd","sbad","others","test"),
newcity = c("nhyd","nsbad","nothers","ntest"),
rating = c(10,20,40,55),newrating = c(100,200,400,550),
stringsAsFactors = FALSE)
table %>%
inner_join(lookup, by = c("city" = "lcity")) %>%
mutate(city = newcity) %>%
select(name, city, rno)
name city rno
1 a nhyd 101
2 b nsbad 102
3 c nhyd 103
4 d nsbad 104
5 e nothers 105

Related

R: How do i create a new column using multiple conditions from a row to select data from another row?

I want to add a new column to a table that is for an "original" measurement. I have this original measurement in the table.
Currently i do this by creating a new table where i filtered thevalues that have been "treated" and then i do a left_join between the original and the new table, matching the bacteria and plate number. This works and is okay because im using a small dataset but I think with a larger dataset this will become problematic. Is there a way to do this without doing the join? I tried a couple of things like using a conditional mutate() but i kept getting errors
Example:
waterData <- data.frame(
bacteria = c("a","a","a","b","b","b","c","c","c",
"a","a","a","b","b","b","c","c","c",
"a","a","a","b","b","b","c","c","c"),
coagulant = c("none","none","none","none","none","none","none","none","none",
"Al","Al","Al","Al","Al","Al","Al","Al","Al",
"Fe","Fe","Fe","Fe","Fe","Fe","Fe","Fe","Fe"),
plateNumber = c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3),
value = runif(n = 27)
)
waterData %>%
filter(coagulant == 'none') %>%
rename(untreatedValue = value) %>%
select(bacteria,plateNumber, untreatedValue) -> untreatedData
cleanData <- left_join(waterData, untreatedData, by = c("bacteria", "plateNumber"))
This produces the correct output of
bacteria coagulant plateNumber value untreatedValue
1 a none 1 0.89144988 0.8914499
2 a none 2 0.70860682 0.7086068
3 a none 3 0.43159203 0.4315920
4 b none 1 0.45186377 0.4518638
5 b none 2 0.69247771 0.6924777
6 b none 3 0.96785414 0.9678541
7 c none 1 0.32297108 0.3229711
8 c none 2 0.62143845 0.6214385
9 c none 3 0.76141500 0.7614150
10 a Al 1 0.13803152 0.8914499
11 a Al 2 0.61881702 0.7086068
12 a Al 3 0.73701268 0.4315920
13 b Al 1 0.88616574 0.4518638
14 b Al 2 0.31901426 0.6924777
15 b Al 3 0.96804077 0.9678541
16 c Al 1 0.46672823 0.3229711
17 c Al 2 0.24288126 0.6214385
18 c Al 3 0.58132458 0.7614150
19 a Fe 1 0.39845872 0.8914499
20 a Fe 2 0.90278081 0.7086068
21 a Fe 3 0.40242276 0.4315920
22 b Fe 1 0.44009792 0.4518638
23 b Fe 2 0.92667612 0.6924777
24 b Fe 3 0.70042384 0.9678541
25 c Fe 1 0.37229116 0.3229711
26 c Fe 2 0.32212515 0.6214385
27 c Fe 3 0.04384053 0.7614150
I think something like this should do the trick:
waterData %>%
group_by(bacteria, plateNumber) %>%
mutate(untreatedValue = value[coagulant == 'none'])
Within each group, we take the value where coagulant is none. This method will break if you have multiple 'none' coagulant values for a particular bacteria <> plateNumber pair. I think that's more of a feature than a bug though since you would like for it to break so that you fix this unexpected thing in your data. The non-breaking version would be something like this that takes the mean of the trials:
waterData %>%
group_by(bacteria, plateNumber) %>%
mutate(untreatedValue = mean(value[coagulant == 'none']))

summarize categorical data based on grouping

I have a data frame in the following form
Id <- c(101,102,103,101,103,103,102,101,103,102)
Service <- c('A','B','A','C','A','A','B','C','A','B')
Type <- c('C','I','C','I','C','C','C','I','I','C')
Channel <- c('ATM1','ATM2','ATM1','Teller','Teller','ATM2','ATM1','ATM1','ATM2','Teller')
amount <- c(11,34,56,37,65,83,26,94,34,55)
df <- data.frame(Id,Service,Channel,Type,amount)
df in tabular formate
Id Service Channel Type amount
101 A ATM1 C 11
102 B ATM2 I 34
103 A ATM1 C 56
101 C Teller I 37
103 A Teller C 65
103 A ATM2 C 83
102 B ATM1 C 26
101 C ATM1 I 94
103 A ATM2 I 34
102 B Teller C 55
I am able to summarize my data using amount column as df %>% group_by(Id) %>% summarise(total = sum(amount)) %>% as.data.frame
Id total
101 142
102 115
103 238
How can I summarize data in a similar way using categorical columns (Service/Type/Channel) and group_by(Id)? I know we can use table() here, but I am trying to create a data frame, which I can use it for further analysis, such as clustering.
One way to restructure the categorical variables in a manner that they can be summarized by Id is to create dummy coded variables, where 1 means presence, 0 means absence. Then, aggregate results in counts of each category (i.e. number of times ATM 1 used) by Id.
We use the dummies package to create dummy coded variables.
Id <- c(101,102,103,101,103,103,102,101,103,102)
Service <- c('A','B','A','C','A','A','B','C','A','B')
Type <- c('C','I','C','I','C','C','C','I','I','C')
Channel <- c('ATM1','ATM2','ATM1','Teller','Teller','ATM2','ATM1','ATM1','ATM2','Teller')
amount <- c(11,34,56,37,65,83,26,94,34,55)
df <- data.frame(Id,Service,Channel,Type,amount)
library(dummies)
df <- dummy.data.frame(df,names=c("Service","Type","Channel"))
aggregate(. ~ Id,data=df,"sum")
...and the output:
> aggregate(. ~ Id,data=df,"sum")
Id ServiceA ServiceB ServiceC ChannelATM1 ChannelATM2 ChannelTeller TypeC
1 101 1 0 2 2 0 1 1
2 102 0 3 0 1 1 1 2
3 103 4 0 0 1 2 1 3
TypeI amount
1 2 142
2 1 115
3 1 238
>
We interpret the results as follows.
Id 101 used Service A once, Service C twice, ATM1 once, a Teller once, Type I once, and Type C twice for a total amount of 142.

Taking cube root and log transformation in R

I have a table with row names corresponding to a set of people and their corresponding body mass estimates. For instance, say a matrix "mass estimate" with these values:
Name Mass
1 person_a 234
2 person_b 190
3 person_c 203
4 person_d 176
How will I, in a single line of R code, take the cube roots of the masses and then have them log transformed?
I am not sure how to ask the data above in a table format, since the final question shows it on a single line. The first column reads "Name" and the second column reads "Mass". Each row has a name (person_a) and the mass (234).
Thanks!
# Sample matrix
mat <- matrix(runif(20), ncol = 5);
# log10-transform the cube root of all entries
mat.trans <- log10(mat^(1/3))
Or with your dataframe example (which is not the same as a matrix):
df <- read.table(text =
"Name Mass
1 person_a 234
2 person_b 190
3 person_c 203
4 person_d 176", sep = "");
# log10-transform the cube root
df$transMass <- log10(df$Mass^(1/3));
# Name Mass transMass
#1 person_a 234 0.7897386
#2 person_b 190 0.7595845
#3 person_c 203 0.7691653
#4 person_d 176 0.7485042
Assuming you have dataframe df and variable named Mass, You can use this:
df$New<-log10(df$Mass^(1/3))

Sort list on numeric values stored as factor

I have 4 data frames with data from different experiments, where each row represents a trial. The participant's id (SID) is stored as a factor. Each one of the data frames look like this:
Experiment 1:
SID trial measure
5402 1 0.6403791
5402 2 -1.8515095
5402 3 -4.8158912
25403 1 NA
25403 2 -3.9424822
25403 3 -2.2100059
I want to make a new data frame with the id's of the participants in each of the experiments, for example:
Exp1 Exp2 Exp3 Exp4
5402 22081 22160 25434
25403 22069 22179 25439
25485 22115 22141 25408
25457 22120 22185 25445
28041 22448 22239 25473
29514 22492 22291 25489
I want each column to be ordered as numbers, that is, 2 comes before 10.
I used unique() to extract the participant id's (SID) in each data frame, but I am having problems ordering the columns.
I tried using:
data.frame(order(unique(df1$SID)),
order(unique(df2$SID)),
order(unique(df3$SID)),
order(unique(df4$SID)))
and I get (without the column names):
38 60 16 32 15
2 9 41 14 41
3 33 5 30 62
4 51 11 18 33
I'm sorry if I am missing something very basic, I am still very new to R.
Thank you for any help!
Edit:
I tried the solutions in the comments, and now I have:
x<-cbind(sort(as.numeric(unique(df1$SID)),decreasing = F),
sort(as.numeric(unique(df2$SID)),decreasing = F),
sort(as.numeric(unique(df3$SID)),decreasing = F),
sort(as.numeric(unique(df4$SID)),decreasing = F) )
Still does not work... I get:
V1 V2 V3 V4
8 6 5 2
2 9 35 11 3
3 10 37 17 184
4 13 38 91 185
5 15 39 103 186
The subject id's are 3 to 5 digit numbers...
If your data looks like this:
df <- read.table(text="
SID trial measure
5402 1 0.6403791
5402 2 -1.8515095
5402 3 -4.8158912
25403 1 NA
25403 2 -3.9424822
25403 3 -2.2100059",
header=TRUE, colClasses = c("factor","integer","numeric"))
I would do something like this:
df <- df[order(as.numeric(as.character(df$SID)), trial),] # sort df on SID (numeric) & trial
split(df$SID, df$trial) # breaks the vector SID into a list of vectors of SID for each trial
If you were worried about unique values you could do:
lapply(split(df$SID, df$trial), unique) # breaks SID into list of unique SIDs for each trial
That will give you a list of participant IDs for each trial, sorted by numeric value but maintaining their factor property.
If you really wanted a data frame, and the number of participants in each experiment were equal, you could use data.frame() on the list, as in: data.frame(split(df$SID, df$trial))
Suppose x and y represent the Exp1 SID and Exp2 SID. You can create a ordered list of unique values as shown below:
x<-factor(x = c(2,5,4,3,6,1,4,5,6,3,2,3))
y<-factor(x = c(2,3,4,2,4,1,4,5,5,3,2,3))
list(exp1=sort(x = unique(x),decreasing = F),y=sort(x = unique(y),decreasing = F))

R - Create a new variable where each observation depends on another table and other variables in the data frame

I have the two following tables:
df <- data.frame(eth = c("A","B","B","A","C"),ZIP1 = c(1,1,2,3,5))
Inc <- data.frame(ZIP2 = c(1,2,3,4,5,6,7),A = c(56,98,43,4,90,19,59), B = c(49,10,69,30,10,4,95),C = c(69,2,59,8,17,84,30))
eth ZIP1 ZIP2 A B C
A 1 1 56 49 69
B 1 2 98 10 2
B 2 3 43 69 59
A 3 4 4 30 8
C 5 5 90 10 17
6 19 4 84
7 59 95 39
I would like to create a variable Inc in the df data frame where for each observation, the value is the intersection of the eth and ZIP of the observation. In my example, it would lead to:
eth ZIP1 Inc
A 1 56
B 1 49
B 2 10
A 3 43
C 5 17
A loop or quite brute force could solve it but it takes time on my dataset, I'm looking for a more subtle way maybe using data.table. It seems to me that it is a very standard question and I'm apologizing if it is, my unability to formulate a precise title for this problem (as you may have noticed..) is maybe why I haven't found any similar question in searching on the forum..
Thanks !
Sure, it can be done in data.table:
library(data.table)
setDT(df)
df[ melt(Inc, id.var="ZIP2", variable.name="eth", value.name="Inc"),
Inc := i.Inc
, on=c(ZIP1 = "ZIP2","eth") ]
The syntax for this "merge-assign" operation is X[i, Xcol := expression, on=merge_cols].
You can run the i = melt(Inc, id.var="ZIP", variable.name="eth", value.name="Inc") part on its own to see how it works. Inside the merge, columns from i can be referred to with i.* prefixes.
Alternately...
setDT(df)
setDT(Inc)
df[, Inc := Inc[.(ZIP1), eth, on="ZIP2", with=FALSE], by=eth]
This is built on a similar idea. The package vignettes are a good place to start for this sort of syntax.
We can use row/column indexing
df$Inc <- Inc[cbind(match(df$ZIP1, Inc$ZIP2), match(df$eth, colnames(Inc)))]
df
# eth ZIP1 Inc
#1 A 1 56
#2 B 1 49
#3 B 2 10
#4 A 3 43
#5 C 5 17
What about this?
library(reshape2)
merge(df, melt(Inc, id="ZIP2"), by.x = c("ZIP1", "eth"), by.y = c("ZIP2", "variable"))
ZIP1 eth value
1 1 A 56
2 1 B 49
3 2 B 10
4 3 A 43
5 5 C 17
Another option:
library(dplyr)
library(tidyr)
Inc %>%
gather(eth, value, -ZIP2) %>%
left_join(df, ., by = c("eth", "ZIP1" = "ZIP2"))
my solution(which maybe seems awkward)
for (i in 1:length(df$eth)) {
df$Inc[i] <- Inc[as.character(df$eth[i])][df$ZIP[i],]
}

Resources