I have a data frame, and I want to transform all columns (say, take the logs or whatever) with columns that match a certain name. So in the example below, I want to take the log of X.1 and X.2, but not Y or Z.1.
df <- data.frame(
Y = sample(0:1, 10, replace = TRUE),
X.1 = sample(1:10),
X.2 = sample(1:10),
Z.1 = sample(151:160)
)
# option 1, won't work for dozens of fields
df$X.1 <- log(df$X.1)
df$X.2 <- log(df$X.2)
Is there a good, efficient way to do this when the dataframe is several gigabtyes?
In the case of functions that will return a data.frame:
cols <- c("X.1","X.2")
df[cols] <- log(df[cols])
Otherwise you will need to use lapply or a loop over the columns. These solutions will be slower than the solution above, so only use them if you must.
df[cols] <- lapply(df[cols], function(x) c(NA,diff(x)))
for(col in cols) {
df[col] <- c(NA,diff(df[col]))
}
vars <- c("X.1", "X.2")
df[vars] <- lapply(df[vars], log)
df <- data.frame(
Y = sample(0:1, 10, replace = TRUE),
X.1 = sample(1:10),
X.2 = sample(1:10),
Z.1 = sample(151:160)
)
df
assuming that you know those variables which requires conversions in the real dataframe (2 and 3 refers to the 2nd and 3rd variables in df which are X.1 and X.2)
df2=log10(df[c(2:3)])
df2
if the variables are far a part in the dataframe you can select them like c(1,3,6,8:10,13) for 1st, 3rd, 6th 8 through 10 and 13th.this works only for numerical variables.
Related
I have a question regarding combining columns based on two conditions.
I have two datasets from an experiment where participants had to type in a code, answer about their gender and eyetracking data was documented. The experiment happened twice (first: random1, second: random2).
eye <- c(1000,230,250,400)
gender <- c(1,2,1,2)
code <- c("ABC","DEF","GHI","JKL")
random1 <- data.frame(code,gender,eye)
eye2 <- c(100,250,230,450)
gender2 <- c(1,1,2,2)
code2 <- c("ABC","DEF","JKL","XYZ")
random2 <- data.frame(code2,gender2,eye2)
Now I want to combine the two dataframes. For all rows where code and gender match, the rows should be combined (so columns added). Code and gender variables of those two rows should become one each (gender3 and code3) and the eyetracking data should be split up into eye_first for random1 and eye_second for random2.
For all rows where there was not found a perfect match for their code and gender values, a new dataset with all of these rows should exist.
#this is what the combined data looks like
gender3 <- c(1,2)
eye_first <- c(1000,400)
eye_second <- c(100, 230)
code3 <- c("ABC", "JKL")
random3 <- data.frame(code3,gender3,eye_first,eye_second)
#this is what the data without match should look like
gender4 <- c(2,1,2)
eye4 <- c(230,250,450)
code4 <- c("DEF","GHI","XYZ")
random4 <- data.frame(code4,gender4,eye4)
I would greatly appreciate your help! Thanks in advance.
Use the same column names for your 2 data.frames and use merge
random1 <- data.frame(code = code, gender = gender, eye = eye)
random2 <- data.frame(code = code2, gender = gender2, eye = eye2)
df <- merge(random1, random2, by = c("code", "gender"), suffixes = c("_first", "_second"))
For your second request, you can use anti_join from dplyr
df2 <- merge(random1, random2, by = c("code", "gender"), suffixes = c("_first", "_second"), all = TRUE) # all = TRUE : keep rows with ids that are only in one of the 2 data.frame
library(dplyr)
anti_join(df2, df, by = c("code", "gender"))
So I have three data sets that I need to merge. These contain school data and read/math scores for grades 4 and 5. One of them is a long form data set that has a lot of missingness in some variables (yes, I do need the data in long form) and the other two have the full missing data in wide form. All of these data frames contain a column that has an unique ID number for each individual in the database.
Here is a full reproducible example that generates a small example of the types of data.frames I am working with... The three data frames that I need to use are the following: school_lf, school4 and school5. school_lf has the long form data with NAs and school4 and school5 are the dfs I need to use to populate the NA's in this long form data (by id and grade)
set.seed(890)
school <- NULL
school$id <-sample(102938:999999, 100)
school$selected <-sample(0:1, 100, replace = T)
school$math4 <- sample(400:500, 100)
school$math5 <- sample(400:500, 100)
school$read4 <- sample(400:500, 100)
school$read5 <- sample(400:500, 100)
school <- as.data.frame(school)
# Delete observations at random from the school df
indm4 <- which(school$math4 %in% sample(school$math4, 25))
school$math4[indm4] <- NA
indm5 <- which(school$math5 %in% sample(school$math5, 50))
school$math5[indm5] <- NA
indr4 <- which(school$read4 %in% sample(school$read4, 70))
school$read4[indr4] <- NA
indr5 <- which(school$read5 %in% sample(school$read5, 81))
school$read5[indr5] <- NA
# Separate Read and Math
read <- as.data.frame(subset(school, select = -c(math4, math5)))
math <- as.data.frame(subset(school, select = -c(read4, read5)))
# Now turn this into long form data...
clr <- melt(read, id.vars = c("id", "selected"), variable.name = "variable", value.name = "readscore")
clm <- melt(math, id.vars = c("id", "selected"), value.name = "mathscore")
# Clean up the grades for each of these...
clr$grade <- ifelse(clr$variable == "read4", 4,
ifelse(clr$variable == "read5", 5, NA))
clm$grade <- ifelse(clm$variable == "math4", 4,
ifelse(clm$variable == "math5", 5, NA))
# Put all these in one df
school_lf <-cbind(clm, clr$readscore)
school_lf$readscore <- school_lf$`clr$readscore` # renames
school_lf$`clr$readscore` <- NULL # deletes
school_lf$variable <- NULL # deletes
###############
# Generate the 2 data frames with IDs that have the full data
set.seed(890)
school4 <- NULL
school4$id <-sample(102938:999999, 100)
school4$selected <-sample(0:1, 100, replace = T)
school4$math4 <- sample(400:500, 100)
school4$read4 <- sample(400:500, 100)
school4$grade <- 4
school4 <- as.data.frame(school4)
set.seed(890)
school5 <- NULL
school5$id <-sample(102938:999999, 100)
school5$selected <-sample(0:1, 100, replace = T)
school5$math5 <- sample(400:500, 100)
school5$read5 <- sample(400:500, 100)
school5$grade <- 5
school5 <- as.data.frame(school5)
I need to merge the wide-form data into the long-form data to replace the NAs with the actual values. I have tried the code below, but it introduces several columns instead of merging the read scores and the math scores where there's NA's. I simply need one column with the read scores and one with the math scores, instead of six separate columns (read.x, read.y, math.x, math.y, mathscore and readscore).
sch <- merge(school_lf, school4, by = c("id", "grade", "selected"), all = T)
sch <- merge(sch, school5, by = c("id", "grade", "selected"), all = T)
Any help is highly appreciated! I've been trying to solve this for hours now and haven't made any progress (so figured I'd ask here)
You can use the coalesce function from dplyr. If a value in the first vector is NA, it will see if the value at the same position in the second vector is not NA and select it. If again NA, it goes to the third.
library(dplyr)
sch %>% mutate(mathscore = coalesce(mathscore, math4, math5)) %>%
mutate(readscore = coalesce(readscore, read4, read5)) %>%
select(id:readscore)
EDIT: I just tried to do this approach on my actual data and it does not work because the replacement data also has some NAs and, as a result, the dfs I try to do coalesce with have differing number of rows... Back to square one.
I was able to figure this out with the following code (albeit it's not the most elegant or straight-forward ,and #Edwin's response helped point me in the right direction. Any suggestions on how to make this code more elegant and efficient are more than welcome!
# Idea: put both in long form and stack on top of one another... then merge like that!
sch4r <- as.data.frame(subset(school4, select = -c(mathscore)))
sch4m <- as.data.frame(subset(school4, select = -c(readscore)))
sch5r <- as.data.frame(subset(school5, select = -c(mathscore)))
sch5m <- as.data.frame(subset(school5, select = -c(readscore)))
# Put these in LF
sch4r_lf <- melt(sch4r, id.vars = c("id", "selected", "grade"), value.name = "readscore")
sch4m_lf <- melt(sch4m, id.vars = c("id", "selected", "grade"), value.name = "mathscore")
sch5r_lf <- melt(sch5r, id.vars = c("id", "selected", "grade"), value.name = "readscore")
sch5m_lf <- melt(sch5m, id.vars = c("id", "selected", "grade"), value.name = "mathscore")
# Combine in one DF
sch_full_4 <-cbind(sch4r_lf, sch4m_lf$mathscore)
sch_full_4$mathscore <- sch_full_4$`sch4m_lf$mathscore`
sch_full_4$`sch4m_lf$mathscore` <- NULL # deletes
sch_full_4$variable <- NULL
sch_full_5 <- cbind(sch5r_lf, sch5m$mathscore)
sch_full_5$mathscore <- sch_full_5$`sch5m$mathscore`
sch_full_5$`sch5m$mathscore` <- NULL
sch_full_5$variable <- NULL
# Stack together
sch_full <- rbind(sch_full_4,sch_full_5)
sch_full$selected <- NULL # delete this column...
# MERGE together
final_school_math <- mutate(school_lf, mathscore = coalesce(school_lf$mathscore, sch_full$mathscore))
final_school_read <- mutate(school_lf, readscore = coalesce(school_lf$readscore, sch_full$readscore))
final_df <- cbind(final_school_math, final_school_read$readscore)
final_df$readscore <- final_df$`final_school_read$readscore`
final_df$`final_school_read$readscore` <- NULL
I have a list of dataframes and I need to transform a certain variable in each of the dataframes as factor.
E.g.
myList <- list(df1 = data.frame(A = sample(10), B = rep(1:2, 10)),
df2 = data.frame(A = sample(10), B = rep(1:2, 10))
)
Lets say that variable B needs to be factor in each dataframe. I've tried this:
TMP <- setNames(lapply(seq_along(myList), function(x) apply(myList[[x]][c("B")], 2, factor)), names(myList))
but it only returns the transformed variable, not the whole dataframe as I need. I know how to do this with for loop, but I don't want to resort to that.
Per comment from David Arenburg, this solution should work:
TMP <- lapply(myList, function(x) {x[, "B"] <- factor(x[, "B"]) ; x}) ; str(TMP)
I'm trying to aggregate a data frame using the function weighted.mean and continue to get an error. My data looks like this:
dat <- data.frame(date, nWords, v1, v2, v3, v4 ...)
I tried something like:
aggregate(dat, by = list(dat$date), weighted.mean, w = dat$nWords)
but got
Error in weighted.mean.default(X[[1L]], ...) :
'x' and 'w' must have the same length
There is another thread which answers this question using plyr but for only one variable, I want to aggregate all my variables that way.
You can do it with data.table:
library(data.table)
#set up your data
dat <- data.frame(date = c("2012-01-01","2012-01-01","2012-01-01","2013-01-01",
"2013-01-01","2013-01-01","2014-01-01","2014-01-01","2014-01-01"),
nwords = 1:9, v1 = rnorm(9), v2 = rnorm(9), v3 = rnorm(9))
#make it into a data.table
dat = data.table(dat, key = "date")
# grab the column names we want, generalized for V1:Vwhatever
c = colnames(dat)[-c(1,2)]
#get the weighted mean by date for each column
for(n in c){
dat[,
n := weighted.mean(get(n), nwords),
with = FALSE,
by = date]
}
#keep only the unique dates and weighted means
wms = unique(dat[,nwords:=NULL])
Try using by:
# your numeric data
x <- 111:120
# the weights
ww <- 10:1
mat <- cbind(x, ww)
# the group variable (in your case is 'date')
y <- c(rep("A", 7), rep("B", 3))
by(data=mat, y, weighted.mean)
If you want the results in a data frame, I suggest the plyr package:
plyr::ddply(data.frame(mat), "y", weighted.mean)
I have the following dummy dataset of 1000 observations:
obs <- 1000
df <- data.frame(
a=c(1,0,0,0,0,1,0,0,0,0),
b=c(0,1,0,0,0,0,1,0,0,0),
c=c(0,0,1,0,0,0,0,1,0,0),
d=c(0,0,0,1,0,0,0,0,1,0),
e=c(0,0,0,0,1,0,0,0,0,1),
f=c(10,2,4,5,2,2,1,2,1,4),
g=sample(c("yes", "no"), obs, replace = TRUE),
h=sample(letters[1:15], obs, replace = TRUE),
i=sample(c("VF","FD", "VD"), obs, replace = TRUE),
j=sample(1:10, obs, replace = TRUE)
)
One key feature of this dataset is that the variables a to e's values are only one 1 and the rest are 0. We are sure the only one of these five columns have a 1 as value.
I found a way to extract these rows given a condition (with a 1) and assign to their respective variables:
df.a <- df[df[,"a"] == 1,,drop=FALSE]
df.b <- df[df[,"b"] == 1,,drop=FALSE]
df.c <- df[df[,"c"] == 1,,drop=FALSE]
df.d <- df[df[,"d"] == 1,,drop=FALSE]
df.e <- df[df[,"e"] == 1,,drop=FALSE]
My dilemma now is to limit the rows saved into df.a to df.e and to merge them afterwards.
Here's a shorter way to create df.merged:
# variables of 'df'
vars <- c("a", "b", "c", "d", "e")
# number of rows to extract
n <- 100
df.merged <- do.call(rbind, lapply(vars, function(x) {
head(df[as.logical(df[[x]]), ], n)
}))
Here, rbind is sufficient. The function rbind.fill is necessary if your data frames differ with respect to the number of columns.
To get the n-rows subset, a simple data[1:n,] does the job.
df.a.sub <- df.a[1:10,]
df.b.sub <- df.b[1:10,]
df.c.sub <- df.c[1:10,]
df.d.sub <- df.d[1:10,]
df.e.sub <- df.e[1:10,]
Finally, merge them by (it took the most time to find a straightforward "merge multiple dataframes" and all I needed to do was rbind.fill(df1, df2, ..., dfn) thanks to this question and answer):
require(plyr)
df.merged <- rbind.fill(df.a.sub, df.b.sub, df.c.sub, df.d.sub, df.e.sub)