I'm struggling with using lapply to recode values parsimoniously.
Let's say I have 10 survey questions with 4 answers each, in which there is always one right or wrong answer. The questions are labeled q_1 through q_10, and my dataframe is called df. I'd like to create new variables with the same sequential labels that simply code the question as "right" (1) or "wrong" (0).
If I were to make a list of the right answers, it would be:
right_answers<-c(1,2,3,4,2,3,4,1,2,4)
Then, I'm trying to write a function that simply recodes all of the variables into new variables while using the same sequential identifier, such as
lapply(1:10, function(fx) {
df$know_[fx]<-ifelse(df$q_[fx]==right_answers[fx],1,0)
})
In a hypothetical universe where this code was remotely correct, I'd get results such that:
id q_1 know_1 q_2 know_2
1 1 1 2 1
2 4 0 3 0
3 3 0 2 1
4 4 0 1 0
Thanks so much for your help!
For the same matrix output as the other answers, I would suggest:
q_names <- paste0("q_", seq_along(right_answers))
answers <- df[q_names]
correct <- mapply(`==`, answers, right_answers)
This should give you a matrix of whether or not each answer was correct:
t(apply(test[,grep("q_", names(test))], 1, function(X) X==right_answers))
You are likely having trouble with this part of the codedf$q_[fx]. You could call the column names using paste. Such as:
df = read.table(text = "
id q_1 q_2
1 1 2
2 4 3
3 3 2
4 4 1", header = TRUE)
right_answers = c(1,2,3,4,2,3,4,1,2,4)
dat2 = sapply(1:2, function(fx) {
ifelse(df[paste("q",fx,sep = "_")]==right_answers[fx],
1,0)
})
This doesn't add columns to your data.frame, but instead makes a new matrix much like #SenorO's answer. You can name the columns in the matrix and then add them to the original data.frame as follows.
colnames(dat2) = paste("know", 1:2, sep = "_")
data.frame(df, dat2)
I'd like to suggest a different approach to your question, using the reshape2 package. In my opinion, this has the advantages of being: 1) more idiomatic R (for what that's worth), 2) more readable code, 3) less error prone, particularly if you want to add analysis in the future. In this approach, everything is done within dataframes, which I think is desirable when possible -- easier to keep all the values for a single record (id in this case) and easier to use the power of R tools.
# Creating a dataframe with the form you describe
df <- data.frame(id=c('1','2','3','4'), q_1 = c(1,4,3,4), q_2 = c(2,3,2,1), q_3 = rep(1, 4), q_4 = rep(2, 4), q_5 = rep(3, 4),
q_6 = rep(4,4), q_7 = c(1,4,3,4), q_8 = c(2,3,2,1), q_9 = rep(1, 4), q_10 = rep(2, 4))
right_answers<-c(1,2,3,4,2,3,4,1,2,4)
# Associating the right answers explicitly with the corresponding question labels in a data frame
answer_df <- data.frame(questions=paste('q', 1:10, sep='_'), right_answers)
library(reshape2)
# "Melting" the dataframe from "wide" to "long" form -- now questions labels are in variable values rather than in column names
melt_df <- melt(df) # melt function is from reshape2 package
# Now merging the correct answers into the data frame containing the observed answers
merge_df <- merge(melt_df, answer_df, by.x='variable', by.y='questions')
# At this point comparing the observed to correct answers is trivial (using as.numeric to convert from logical to 0/1 as you request, though keeping as TRUE/FALSE may be clearer)
merge_df$correct <- as.numeric(merge_df$value==merge_df$right_answers)
# If desireable (not sure it is), put back into "wide" dataframe form
cast_obs_df <- dcast(merge_df, id ~ variable, value.var='value') # dcast function is from reshape2 package
cast_cor_df <- dcast(merge_df, id ~ variable, value.var='correct')
names(cast_cor_df) <- gsub('q_', 'know_', names(cast_cor_df))
final_df <- merge(cast_obs_df, cast_cor_df)
The new tidyr package would probably be even better here than reshape2.
Related
I have some large data frames that are big enough to push the limits of R on my machine; e.g., the one on which I'm currently working is 2 columns by 70 million rows. The contents aren't important, but just in case, column 1 is a string and column 2 is an integer.
What I would like to do is split that data frame into n parts (say, 20, but preferably something that could change on a case-by-case basis) so that I can work on each of the smaller data frames one at a time. That means that (a) the result has to produce things that are named (e.g., "newdf_1", "newdf_2", ... "newdf_20" or something), and (b) each line in the original data frame needs to be in one (and only one) of the new "sub" data frames. The order does not matter, but doing it sequentially by rows makes sense to me.
Once I do the work, I will start to recombine them (using rbind()) one pair at a time.
I've looked at split(), but from what I can tell, it is designed to work with factors (which I don't have).
Any ideas?
You can create a new column and split the data frame based on that column. The column does not need to be a factor, but need to be a data type that can be converted to a factor by the split function.
# Number of groups
N <- 20
dat$group <- 1:nrow(dat) %% N
# Add 1 to group
dat$group <- dat$group + 1
# Split the dat by group
dat_list <- split(dat, f = ~group)
# Set the name of the list
names(dat_list) <- paste0("newdf_", 1:N)
Data
set.seed(123)
# Create example data frame
dat <- data.frame(
A = sample(letters, size = 70000000, replace = TRUE),
B = rpois(70000000, lambda = 1)
)
Here's a tidyverse based solution. Try using read_csv_chunked().
# practice data
tibble(string = sample(letters, 1e6, replace = TRUE),
value = rnorm(1e6) %>%
write_csv("test.csv")
# here's the solution
partial_data <- read_csv_chunked("test.csv",
DataFrameCallback$new(function(x, pos) filter(x, string == "a")),
chunk_size = 1000)
You can wrap the call to read_csv_chunked in a function where you change the string that you subset on.
This is more or less a repeat of this question:
How to read only lines that fulfil a condition from a csv into R?
I have this problem that seems awfully trivial, however I can't seem to find a decent way to work around it
I have this data frame that resembles a table with regression outputs:
Variabel Model 1 Model 2
A 5.56545* 35.45645343434***
B 9.334223232** 14.45465464***
C 64.33232323** 3.798877*
Both columns 2 (Model 1) and 3 (Model 2) are string variables. However, I would like round of the numeric part to only have 3 digits. My idea to go about this would be to split up each column, convert the numbers to numeric, round them off to 3 digits and but it back together. To me this seems rather trivial, however i can't seem to find a nice way of working around it, especially one where I can do it for both columns at once.
Could it be done with base R regular expressions, with packages such as stringr and how is it done in the least number of steps.
Thanks in advance
round(as.numeric(gsub("[^0-9\\.]", "", "35.45645343434***")),digits = 3)
MRE:
df <- data.frame(Variable = c("A", "B", "C"),
Model1 = x,
Model2 = x)
As a function:
f <- function(x) return(paste0(round(as.numeric(gsub("[^0-9\\.]", "", x)),digits = 3), gsub("[0-9\\.]", "", x)))
Use function:
cbind(df[, 1, drop=FALSE], apply(df[,-1], 2, f))
Returns:
Variable Model1 Model2
1 A 35.456*** 35.456***
2 B 14.455*** 14.455***
3 C 3.799* 3.799*
I have data that looks like this:
A set of 10 character variables
Char<-c("A","B","C","D","E","F","G","H","I","J")
And a data frame that looks like this
Col1<-seq(1:25)
Col2<-c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5)
DF<-data.frame(Col1,Col2)
What I would like to do is to add a third column to the data frame, with the logic that 1=A, 2=B, 3= C and so on. So the end result would be
Col3<-c("A","A","A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D","E","E","E","E","E")
DF<-data.frame(Col1,Col2,Col3)
For this simple example I could go with a simple substitution like this question:
Create new column based on 4 values in another column
But my actual data set is much bigger with a lot more variables than this simple example, so writing out the equivalents as in the above answer is not a possibility.
So I would like to have a bit of code that can be applied to a much larger data frame. Perhaps something that looped through all the values of Col2 and matched them to the location of Char.
1=Char[1] 2=Char[2] 3=Char[3]...... for the entire length of Col2
Or any other way that could scale up to a long monstrous data frame
# Values that Col2 might have taken
levels = c(1, 2, 3, 4, 5)
# Labels for the levels in same order as levels
labels = c('A', 'B', 'C', 'D', 'E')
DF$Col3 <- factor(DF$Col2, levels = levels, labels = labels)
I know it may be taboo to use for loops in R, but I tried this out and it worked well.
for (i in length(DF$Col2)) {
DF$Col3[i] <- Char[DF$Col2[i]]
}
Would that be sufficient? I think you could also unique(DF$Col2) or levels(factor(DF$Col2))
Perhaps though I'm misunderstanding your question.
If you wanted to use each column as an index into some vector (I'll use letters so I can index up to 25), returning a data frame of the same dimension of DF, you could use:
transformed <- as.data.frame(lapply(DF, function(x) letters[x]))
head(transformed)
# Col1 Col2
# 1 a a
# 2 b a
# 3 c a
# 4 d a
# 5 e a
# 6 f b
You could then combine this with your original data frame with cbind(DF, transformed).
Why not make a key and join?
library(dplyr)
letter_key = data_frame(letter__ID = 1:26,
letter = letters)
DF %>%
rename(letter__ID = Col2) %>%
left_join(letter_key)
This kind of thing can also be done with factors
I have a data.frame with 3 columns, each of which can be thought of as a factor. I'd like to compute some stats on the data.frame and store it in a new frame. To be more specific, I have the following fields:
obs, len, src
A 10 X
B 10 Y
I'd like to compute the breakdown of each source at each length (i.e. what percentage of observations from source X that are of length 10 are "A", "B", etc.)
An obvious approach to this is to use two for loops to iterate over the lengths and sources and then use nrow() and count() to get the values I'd need to compute, like so:
relevant_subset <- data[data$src==source & data$len==length,]
breakdown_info <- count(relevant_subset)
breakdown_info$frac <- breakdown_info$freq / nrow(relevant_subset)
Is there a way to avoid using the double for loop and use a more vectorized approach? Is there a smart way to pre-allocate the new frame that would hold the modified breakdown_info for each length and source?
aggregate is your friend for these tasks:
Example data:
set.seed(23)
test <- data.frame(
obs=sample(LETTERS[1:2],20,replace=TRUE),
len=sample(c(10,20),20,replace=TRUE),
src=sample(LETTERS[24:25],20,replace=TRUE)
)
Aggregate it:
aggregate(obs ~ src + len,data=test, function(x) prop.table(table(x)))
src len obs.A obs.B
1 X 10 0.6000000 0.4000000
2 Y 10 0.2000000 0.8000000
3 X 20 0.2500000 0.7500000
4 Y 20 0.1666667 0.8333333
This is what the plyr package was made for!
The format is <input_type><output_type>ply. For example if the input is a data.frame and you want the output to be a data.frame use ddply.
To use it, you specify the input data.frame, the columns to group by and then a function that constructs a data.frame from each group. The resulting data.frames appended with the grouping columns are assembled together into the output data.frame.
In something similar to your example, you could do
require(plyr)
a <- data.frame(
obs=factor(c('A','A','A','B','B')),
len=c(10,10,10,10,210),
src=factor(c('X','X','Y','Y','Z')))
then
z <- ddply(
a,
.(obs),
function(df){
data.frame(mean.len=mean(df$len))
})
would produce
data.frame(
obs=c('A', 'B'),
mean.length(10, 110))
while
ddply(a, .(src), function(df){
data.frame(
num.obs.A = sum(df$obs == 'A'),
num.obs.B = sum(df$obs == 'B'))})
would produce
data.frame(
src=c('X','Y', 'Z'),
num.obs.A = c(3,1,0),
num.obs.B = c(0,1,1))
The website is http://plyr.had.co.nz/ has good documentation too.
You haven't stated a reason why you want a data.frame here as output. Perhaps it's best for you, perhaps not. You also aren't really clear on what proportions are what but I think the following might solve your problem best.
prop.table( table(test) )
You could enter it slightly differently and play with the order of columns so that what you want to compare is most easily examined. But, this output is a 3-dimensional array and quite a bit different from a data.frame.
(example of alternate usage)
prop.table(with(test, table(src, obs, len) ))
I have a data frame on which I calculate a run length encoding for a specific column. The values of the column, dir, are either -1, 0, or 1.
dir.rle <- rle(df$dir)
I then take the run lengths and compute segmented cumulative sums across another column in the data frame. I'm using a for loop, but I feel like there should be a way to do this more intelligently.
ndx <- 1
for(i in 1:length(dir.rle$lengths)) {
l <- dir.rle$lengths[i] - 1
s <- ndx
e <- ndx+l
tmp[s:e,]$cumval <- cumsum(df[s:e,]$val)
ndx <- e + 1
}
The run lengths of dir define the start, s, and end, e, for each run. The above code works but it does not feel like idiomatic R code. I feel as if there should be another way to do it without the loop.
This can be broken down into a two step problem. First, if we create an indexing column based off of the rle, then we can use that to group by and run the cumsum. The group by can then be performed by any number of aggregation techniques. I'll show two options, one using data.table and the other using plyr.
library(data.table)
library(plyr)
#data.table is the same thing as a data.frame for most purposes
#Fake data
dat <- data.table(dir = sample(-1:1, 20, TRUE), value = rnorm(20))
dir.rle <- rle(dat$dir)
#Compute an indexing column to group by
dat <- transform(dat, indexer = rep(1:length(dir.rle$lengths), dir.rle$lengths))
#What does the indexer column look like?
> head(dat)
dir value indexer
[1,] 1 0.5045807 1
[2,] 0 0.2660617 2
[3,] 1 1.0369641 3
[4,] 1 -0.4514342 3
[5,] -1 -0.3968631 4
[6,] -1 -2.1517093 4
#data.table approach
dat[, cumsum(value), by = indexer]
#plyr approach
ddply(dat, "indexer", summarize, V1 = cumsum(value))
Both Spacedman & Chase make the key point that a grouping variable simplifies everything (and Chase lays out two nice ways to proceed from there).
I'll just throw in an alternative approach to forming that grouping variable. It doesn't use rle and, at least to me, feels more intuitive. Basically, at each point where diff() detects a change in value, the cumsum that will form your grouping variable is incremented by one:
df$group <- c(0, cumsum(!(diff(df$dir)==0)))
# Or, equivalently
df$group <- c(0, cumsum(as.logical(diff(df$dir))))
Add a 'group' column to the data frame. Something like:
df=data.frame(z=rnorm(100)) # dummy data
df$dir = sign(df$z) # dummy +/- 1
rl = rle(df$dir)
df$group = rep(1:length(rl$lengths),times=rl$lengths)
then use tapply to sum within groups:
tapply(df$z,df$group,sum)