R Compare Columns across Dataframes to Match Values - r

I have two dataframes, looking at houses (n=6) and certain dates (n=22).
ORIGINAL is the original dataset. It contains 38 observations on 5 variables. Not all houses have all the dates listed, and vice versa, leading to errors in calculations with different length variables.
SAMPLE is a new empty dataset. It contains 132 (6 x 22) observations on the same 5 variables. Now there is an observation for every household for every date.
House Day Mongoose Fruit Elephant
A 1 40 7 0.6
A 6 32 12 4.2
B 2 50 3 4.0
B 4 51 4 8.6
B 6 8 7 12.1
C 2 12 8 13.0
I am trying to fill in the rest of SAMPLE by asking R to compare HouseID and Date between the two dataframes; if they match, the rest of the variables (mongoose, fruit, elephant) should be copied over for that observation.
I tried this to no avail...
for(i in 1:nrow(original))
{
if ((sample$Day == original$Day) && (sample$House == original$House))
{
sample$Mongoose[i] <- original$Mongoose[i]
sample$Fruit[i] <- original$Fruit[i]
sample$Elephant[i] <- original$Elephant[i]
}
}
The following results:
I get the following 3 errors in sequence
In sample$Day == test$Day : longer object length is not a multiple of
shorter object length
In is.na(e1) | is.na(e2) :longer object length is not a multiple of
shorter object length
In ==.default(sample$House, test$House) :longer object length is
not a multiple of shorter object length
The data DOES copy over, but incorrectly. All the values get transferred to the A house and sequential date, rather than the appropriate house and date.
I.e., it looks like this
House Day Mongoose Fruit Elephant
A 1 40 7 0.6
A 2 50 3 4.0
A 3 51 4 8.6
A 4 8 7 12.1
A 5 12 8 13.0
A 6 32 12 4.2
B 1
B 2
B 3 [...]
When it should (in essence) look like this:
House Day Mongoose Fruit Elephant
A 1 40 7 0.6
A 2
A 3
A 4
A 5
A 6 32 12 4.2 [rest of A houses have no data]
B 1
B 2 50 3 4.0
B 3
B 4 51 4 8.6
B 5
B 6 8 7 12.1 [rest of B houses have no data]
C 1
C 2 12 8 13.0
Please advise; I will eventually have to extend this technique to look at a sample dataset with 198K entries, and a test dataset with 115K.
Thanks!

Sounds to me like this should work:
merge(sample, original, by = c("House", "Day"), all.x = TRUE)
But hard to tell without a reproducible example. You may also want to look into dplyr::left_join(). That is, assuming your data looks like the following:
sample <- data.frame(House = rep(c("A", "B", "C"), each = 6),
Day = rep(1:6, 3))
# head(sample)
# House Day
# 1 A 1
# 2 A 2
# 3 A 3
# 4 A 4
# 5 A 5
# 6 A 6
original <- data.frame(House = c("A", "A", "B", "B", "C"),
Day = c(1, 6, 2, 4, 2),
Mongoose = c(40, 32, 50, 51, 8),
Fruit = c(7, 12, 3, 4, 8),
Elephant = c(0.6, 4.2, 4.0, 8.6, 12.1))
# head(original)
# House Day Mongoose Fruit Elephant
# 1 A 1 40 7 0.6
# 2 A 6 32 12 4.2
# 3 B 2 50 3 4.0
# 4 B 4 51 4 8.6
# 5 C 2 8 8 12.1
We obtain:
# head(merge(sample, original, by = c("House", "Day"), all.x = TRUE))
# House Day Mongoose Fruit Elephant
# 1 A 1 40 7 0.6
# 2 A 2 NA NA NA
# 3 A 3 NA NA NA
# 4 A 4 NA NA NA
# 5 A 5 NA NA NA
# 6 A 6 32 12 4.2

It could be a small tweak, look at this this line of your original code:
if ((sample$Day == original$Day) && (sample$House == original$House))
See if you can change it to this:
if ((sample$Day[i] == original$Day[i]) && (sample$House[i] == original$House[i]))
Because:
You are using a for loop with an i variable,
which you do very well with lines such as sample$Mongoose[i] <- original$Mongoose[i]
but in your example it seems the if statement is not actually making use of the i variable
so we revise it to make use of i so it will be comparing specifically that observation/rows's sample$Day with that observation/rows's original$Day, and the same for sample$House vs original$House

Related

How can I insert blank rows every 3 existing rows in a data frame?

How can I insert blank rows every 3 existing rows in a data frame?
After a web scraping process I get a dataframe with the information I need, however the final excel format requires that I add a blank row every 3 rows. I have searched the web for help but have not found a solution yet.
With hypothetical data, the structure of my data frame is as follows:
mi_df <- data.frame(
"ID" = rep(1:3,c(3,3,3)),
"X" = as.character(c("a", "a", "a", "b", "b", "b", "c", "c", "c")),
"Y" = seq(1,18, by=2)
)
mi_df
ID X Y
1 1 a 1
2 1 a 3
3 1 a 5
4 2 b 7
5 2 b 9
6 2 b 11
7 3 c 13
8 3 c 15
9 3 c 17
The result I hope for is something like this
ID X Y
1 1 a 1
2 1 a 3
3 1 a 5
4
5 2 b 7
6 2 b 9
7 2 b 11
8
9 3 c 13
10 3 c 15
11 3 c 17
If the indices of a data frame contain NA, then the output will have NA rows. So my goal is to create a vector like 1 2 3 NA 4 5 6 NA ... and set it as the indices of mi_df.
cut <- rep(1:(nrow(mi_df)/3), each = 3)
mi_df[sapply(split(1:nrow(mi_df), cut), c, NA), ]
# ID X Y
# 1 1 a 1
# 2 1 a 3
# 3 1 a 5
# NA NA <NA> NA
# 4 2 b 7
# 5 2 b 9
# 6 2 b 11
# NA.1 NA <NA> NA
# 7 3 c 13
# 8 3 c 15
# 9 3 c 17
# NA.2 NA <NA> NA
If nrow(mi_df) is not a multiple of 3, then the following is a general solution:
# Version 1
cut <- rep(1:ceiling(nrow(mi_df)/3), each = 3, len = nrow(mi_df))
mi_df[Reduce(c, lapply(split(1:nrow(mi_df), cut), c, NA)), ]
# Version 2
cut <- rep(1:ceiling(nrow(mi_df)/3), each = 3, len = nrow(mi_df))
mi_df[Reduce(function(x, y) c(x, NA, y), split(1:nrow(mi_df), cut)), ]
Don't mind the NA in the output because some functions which write data to an excel file have an optional argument controls if NA values are converted to strings or be empty. E.g.
library(openxlsx)
write.xlsx(df, "test.xlsx", keepNA = FALSE) # defaults to FALSE
tmp <- split(mi_df, rep(1:(nrow(mi_df) / 3), each = 3))
# or split(mi_df, ggplot2::cut_width(seq_len(nrow(mi_df)), 3, center = 2))
do.call(rbind, lapply(tmp, function(x) { x[4, ] <- NA; x }))
ID X Y
1.1 1 a 1
1.2 1 a 3
1.3 1 a 5
1.4 NA <NA> NA
2.4 2 b 7
2.5 2 b 9
2.6 2 b 11
2.4.1 NA <NA> NA
3.7 3 c 13
3.8 3 c 15
3.9 3 c 17
3.4 NA <NA> NA
You can make empty rows like you show by assigning an empty character vector ("") instead of NA, but this will convert your columns to character, and I wouldn't recommend it.
My recommendation is somewhat different from all the other answers: don't make a mess of your dataset inside R . Use the existing packages to write to designated rows in an Excel workbook. For example, with the package xlConnect, the method writeWorksheet (called from writeWorksheetToFile ) includes these arguments:
object The workbook to write to data Data to write
sheet The name or index of the sheet to write to
startRow Index of the first row to write to. The default is startRow = 1.
startCol Index of the first column to write to. The default is startCol = 1.
So if you simply set up a loop that writes 3 rows of your data file at a time, then moves the row index down by 4 and writes the next 3 rows, etc., you're all set.
Here's one method.
Splits into list by ID, adds empty row, then binds list back into data frame.
mi_df2 <- do.call(rbind,Map(rbind,split(mi_df,mi_df$ID),rep("",3)))
rownames(mi_df2) <- NULL

R - Replace values in a specific even column based on values from a odd specific column - Application to the whole dataframe

My data frame:
data <- data.frame(A = c(1,5,6,8,7), qA = c(1,2,2,3,1), B = c(2,5,6,8,4), qB = c(2,2,1,3,1))
For the case A and qA (= quality A): I want the values assigned to the quality value 1 and 3 are replaced by NA
And the same for the case B and qB
The final data has to be like this:
desired_data <- data.frame(A = c("NA",5,6,"NA","NA"), qA = c(1,2,2,3,1), B = c(2,5,"NA","NA","NA"), qB = c(2,2,1,3,1))
My question is how to perform that?
I have a big dataframe with about 90 columns, so I need code which doesn't require the column names to work properly.
To help, I have this part of code which select columns starting with "q" letter:
data[,grep("^[q]", colnames(data))]
You could just do this...
data[,seq(1,ncol(data),2)][(data[,seq(2,ncol(data),2)]==1)|
(data[,seq(2,ncol(data),2)]==3)] <- NA
data
A qA B qB
1 NA 1 2 2
2 5 2 5 2
3 6 2 NA 1
4 NA 3 NA 3
5 NA 1 NA 1
One solution is to separate in two tables and use vectorisation in base R
data <- data.frame(A = c(1,5,6,8,7), qA = c(1,2,2,3,1), B = c(2,5,6,8,4), qB = c(2,2,1,3,1))
data
#> A qA B qB
#> 1 1 1 2 2
#> 2 5 2 5 2
#> 3 6 2 6 1
#> 4 8 3 8 3
#> 5 7 1 4 1
quality <- data[,grep("^[q]", colnames(data))]
data2 <- data[,setdiff(colnames(data), names(quality))]
data2[quality == 1 | quality == 3] <- NA
data2
#> A B
#> 1 NA 2
#> 2 5 5
#> 3 6 NA
#> 4 NA NA
#> 5 NA NA

Double match in r

I have a huge data set in r with one row per individual. One of my columns shows a family identifier (note, sex==1, male, sex==2, female).
ind sex income hw family.id
1 1 10 6 fam.1
2 2 8 7 fam.1
3 2 15 8 fam.2
4 1 7 4 fam.3
5 2 9 5 fam.3
How can I do a "double matching" so I can match couples in the data set for many of the variables that I am interested? For example, let's say individual 2, female, married with individual 1, male, should receive an entry in a new column with his income (same goes for hw):
ind sex income hw family.id income.male hw.male
1 1 10 6 fam.1 10 6
2 2 8 7 fam.1 8 6
3 2 15 8 fam.2 - -
4 1 7 4 fam.3 7 7
5 2 9 5 fam.3 9 7
I've said "double matching" in the title because I don't need to match only the family.ID, but I need to find a male that matches this fam.id. The reason I am doing this is because later all males will be dropped from the data set and I will remain only with rows for females.
I am sorry I can't show any coding I've worked. I've tried many approaches using match, ifelse, lapply and even unlist but it is not worth to add it here as unfortunately I can't make it work.
Anyone has a clue? We can work with both data.frames or data.tables environments.
You should go with data.table package. Here is an example:
library(data.table)
dt <- data.table(ind = c(1, 2, 3, 4, 5), sex =c(1, 2, 2, 1, 2), income = c(10, 8, 15, 7, 9), hw = c(6, 7, 8, 4, 5), family.id = c('fam.1', 'fam.1', 'fam.2', 'fam.3', 'fam.3'))
setkeyv(dt, 'family.id')
dt2 <- dt[dt[sex == 1, list(family.id, income, hw)]]
It will take income and hw of males (dt[sex == 1, list(family.id, income, hw)]) and match all individuals on family.id. As a result you obtain:
ind sex income hw family.id i.income i.hw
1: 1 1 10 6 fam.1 10 6
2: 2 2 8 7 fam.1 10 6
3: 4 1 7 4 fam.3 7 4
4: 5 2 9 5 fam.3 7 4
columns with prefix i. containing values of males for every family. Note that if no male is present you will not receive any row. If you still need this you can do:
dt2 <- merge(dt, dt[sex == 1, list(family.id, income, hw)], by = 'family.id', suffixes = c('', '.i'), all = TRUE)
to receive
family.id ind sex income hw income.i hw.i
1: fam.1 1 1 10 6 10 6
2: fam.1 2 2 8 7 10 6
3: fam.2 3 2 15 8 NA NA
4: fam.3 4 1 7 4 7 4
5: fam.3 5 2 9 5 7 4
Later when you need to drop male data you do:
dt2[sex == 2]
Let's assume that the dataframe is named 'dat'. You can merge the males and females by family.id with the merge function. You proposed answeer didn't make sense to me or to the otehr commenters but you can reassign "income" or "hw" within this new object.
> merge( dat[ dat$sex==1, ], dat[dat$sex==2,] , by="family.id")
family.id ind.x sex.x income.x hw.x ind.y sex.y income.y hw.y
1 fam.1 1 1 10 6 2 2 8 7
2 fam.3 4 1 7 4 5 2 9 5
To follow up on my comment:
require(data.table)
dt[dt[sex == 1L], c("i.m", "hw.m") := .(i.income, i.hw), on="family.id"][]
Extract the row indices where sex == 'male' for each family.id and add two columns by reference with the corresponding income and hw values.
where dt is:
dt = fread('ind sex income hw family.id
1 1 10 6 fam.1
2 2 8 7 fam.1
3 2 15 8 fam.2
4 1 7 4 fam.3
5 2 9 5 fam.3')

R data.table with variable number of columns

For each student in a data set, a certain set of scores may have been collected. We want to calculate the mean for each student, but using only the scores in the columns that were germane to that student.
The columns required in a calculation are different for each row. I've figured how to write this in R using the usual tools, but am trying to rewrite with data.table, partly for fun, but also partly in anticipation of success in this small project which might lead to the need to make calculations for lots and lots of rows.
Here is a small working example of "choose a specific column set for each row problem."
set.seed(123234)
## Suppose these are 10 students in various grades
dat <- data.frame(id = 1:10, grade = rep(3:7, by = 2),
A = sample(c(1:5, 9), 10, replace = TRUE),
B = sample(c(1:5, 9), 10, replace = TRUE),
C = sample(c(1:5, 9), 10, replace = TRUE),
D = sample(c(1:5, 9), 10, replace = TRUE))
## 9 is a marker for missing value, there might also be
## NAs in real data, and those are supposed to be regarded
## differently in some exercises
## Students in various grades are administered different
## tests. A data structure gives the grade to test linkage.
## The letters are column names in dat
lookup <- list("3" = c("A", "B"),
"4" = c("A", "C"),
"5" = c("B", "C", "D"),
"6" = c("A", "B", "C", "D"),
"7" = c("C", "D"),
"8" = c("C"))
## wrapper around that lookup because I kept getting confused
getLookup <- function(grade){
lookup[[as.character(grade)]]
}
## Function that receives one row (named vector)
## from data frame and chooses columns and makes calculation
getMean <- function(arow, lookup){
scores <- arow[getLookup(arow["grade"])]
mean(scores[scores != 9], na.rm = TRUE)
}
stuscores <- apply(dat, 1, function(x) getMean(x, lookup))
result <- data.frame(dat, stuscores)
result
## If the data is 1000s of thousands of rows,
## I will wish I could use data.table to do that.
## Client will want students sorted by state, district, classroom,
## etc.
## However, am stumped on how to specify the adjustable
## column-name chooser
library(data.table)
DT <- data.table(dat)
## How to write call to getMean correctly?
## Want to do this for each participant (no grouping)
setkey(DT, id)
The desired output is the student average for the appropriate columns, like so:
> result
id grade A B C D stuscores
1 1 3 9 9 1 4 NaN
2 2 4 5 4 1 5 3.0
3 3 5 1 3 5 9 4.0
4 4 6 5 2 4 5 4.0
5 5 7 9 1 1 3 2.0
6 6 3 3 3 4 3 3.0
7 7 4 9 2 9 2 NaN
8 8 5 3 9 2 9 2.0
9 9 6 2 3 2 5 3.0
10 10 7 3 2 4 1 2.5
Then what? I've written a lot of mistakes so far...
I did not find any examples in the data table examples in which the columns to be used in calculations for each row was itself a variable, I thank you for your advice.
I was not asking anybody to write code for me, I'm asking for advice on how to get started with this problem.
First of all, when creating a reproducible example using functions such as sample (which set a random seed each time you run it), you should use set.seed.
Second of all, instead of looping over each row, you could just loop over the lookup list which will always be smaller than the data (many times significantly smaller) and combine it with rowMeans. You can also do it with base R, but you asked for a data.table solution so here goes (for the purposes of this solution I've converted all 9 to NAs, but you can try to generalize this to your specific case too)
So using set.seed(123), your function gives
apply(dat, 1, function(x) getMean(x, lookup))
# [1] 2.000000 5.000000 4.666667 4.500000 2.500000 1.000000 4.000000 2.333333 2.500000 1.500000
And here's a possible data.table application which runs only over the lookup list (for loops on lists are very efficient in R btw, see here)
## convert all 9 values to NAs
is.na(dat) <- dat == 9L
## convert your original data to `data.table`,
## there is no need in additional copy of the data if the data is huge
setDT(dat)
## loop only over the list
for(i in names(lookup)) {
dat[grade == i, res := rowMeans(as.matrix(.SD[, lookup[[i]], with = FALSE]), na.rm = TRUE)]
}
dat
# id grade A B C D res
# 1: 1 3 2 NA NA NA 2.000000
# 2: 2 4 5 3 5 NA 5.000000
# 3: 3 5 3 5 4 5 4.666667
# 4: 4 6 NA 4 NA 5 4.500000
# 5: 5 7 NA 1 4 1 2.500000
# 6: 6 3 1 NA 5 3 1.000000
# 7: 7 4 4 2 4 5 4.000000
# 8: 8 5 NA 1 4 2 2.333333
# 9: NA 6 4 2 2 2 2.500000
# 10: 10 7 3 NA 1 2 1.500000
Possibly, this could be improved utilizing set, but I can't think of a good way currently.
P.S.
As suggested by #Arun, please take a look at the vignettes he himself wrote here in order to get familiar with the := operator, .SD, with = FALSE, etc.
Here's another data.table approach using melt.data.table (needs data.table 1.9.5+) and then joins between data.tables:
DT_m <- setkey(melt.data.table(DT, c("id", "grade"), value.name = "score"), grade, variable)
lookup_dt <- data.table(grade = rep(as.integer(names(lookup)), lengths(lookup)),
variable = unlist(lookup), key = "grade,variable")
score_summary <- setkey(DT_m[lookup_dt, nomatch = 0L,
.(res = mean(score[score != 9], na.rm = TRUE)), by = id], id)
setkey(DT, id)[score_summary, res := res]
# id grade A B C D mean_score
# 1: 1 3 9 9 1 4 NaN
# 2: 2 4 5 4 1 5 3.0
# 3: 3 5 1 3 5 9 4.0
# 4: 4 6 5 2 4 5 4.0
# 5: 5 7 9 1 1 3 2.0
# 6: 6 3 3 3 4 3 3.0
# 7: 7 4 9 2 9 2 NaN
# 8: 8 5 3 9 2 9 2.0
# 9: 9 6 2 3 2 5 3.0
#10: 10 7 3 2 4 1 2.5
It's more verbose, but just over twice as fast:
microbenchmark(da_method(), nk_method(), times = 1000)
#Unit: milliseconds
# expr min lq mean median uq max neval
# da_method() 17.465893 17.845689 19.249615 18.079206 18.337346 181.76369 1000
# nk_method() 7.047405 7.282276 7.757005 7.489351 7.667614 20.30658 1000

Read csv with two headers into a data.frame

Apologies for the seemingly simple question, but I can't seem to find a solution to the following re-arrangement problem.
I'm used to using read.csv to read in files with a header row, but I have an excel spreadsheet with two 'header' rows - cell identifier (a, b, c ... g) and three sets of measurements (x, y and z; 1000s each) for each cell:
a b
x y z x y z
10 1 5 22 1 6
12 2 6 21 3 5
12 2 7 11 3 7
13 1 4 33 2 8
12 2 5 44 1 9
csv file below:
a,,,b,,
x,y,z,x,y,z
10,1,5,22,1,6
12,2,6,21,3,5
12,2,7,11,3,7
13,1,4,33,2,8
12,2,5,44,1,9
How can I get to a data.frame in R as shown below?
cell x y z
a 10 1 5
a 12 2 6
a 12 2 7
a 13 1 4
a 12 2 5
b 22 1 6
b 21 3 5
b 11 3 7
b 33 2 8
b 44 1 9
Use base R reshape():
temp = read.delim(text="a,,,b,,
x,y,z,x,y,z
10,1,5,22,1,6
12,2,6,21,3,5
12,2,7,11,3,7
13,1,4,33,2,8
12,2,5,44,1,9", header=TRUE, skip=1, sep=",")
names(temp)[1:3] = paste0(names(temp[1:3]), ".0")
OUT = reshape(temp, direction="long", ids=rownames(temp), varying=1:ncol(temp))
OUT
# time x y z id
# 1.0 0 10 1 5 1
# 2.0 0 12 2 6 2
# 3.0 0 12 2 7 3
# 4.0 0 13 1 4 4
# 5.0 0 12 2 5 5
# 1.1 1 22 1 6 1
# 2.1 1 21 3 5 2
# 3.1 1 11 3 7 3
# 4.1 1 33 2 8 4
# 5.1 1 44 1 9 5
Basically, you should just skip the first row, where there are the letters a-g every third column. Since the sub-column names are all the same, R will automatically append a grouping number after all of the columns after the third column; so we need to add a grouping number to the first three columns.
You can either then create an "id" variable, or, as I've done here, just use the row names for the IDs.
You can change the "time" variable to your "cell" variable as follows:
# Change the following to the number of levels you actually have
OUT$cell = factor(OUT$time, labels=letters[1:2])
Then, drop the "time" column:
OUT$time = NULL
Update
To answer a question in the comments below, if the first label was something other than a letter, this should still pose no problem. The sequence I would take would be as follows:
temp = read.csv("path/to/file.csv", skip=1, stringsAsFactors = FALSE)
GROUPS = read.csv("path/to/file.csv", header=FALSE,
nrows=1, stringsAsFactors = FALSE)
GROUPS = GROUPS[!is.na(GROUPS)]
names(temp)[1:3] = paste0(names(temp[1:3]), ".0")
OUT = reshape(temp, direction="long", ids=rownames(temp), varying=1:ncol(temp))
OUT$cell = factor(temp$time, labels=GROUPS)
OUT$time = NULL

Resources