Create a new column based on values from other variables - r

I have data that looks like this:
A set of 10 character variables
Char<-c("A","B","C","D","E","F","G","H","I","J")
And a data frame that looks like this
Col1<-seq(1:25)
Col2<-c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5)
DF<-data.frame(Col1,Col2)
What I would like to do is to add a third column to the data frame, with the logic that 1=A, 2=B, 3= C and so on. So the end result would be
Col3<-c("A","A","A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D","E","E","E","E","E")
DF<-data.frame(Col1,Col2,Col3)
For this simple example I could go with a simple substitution like this question:
Create new column based on 4 values in another column
But my actual data set is much bigger with a lot more variables than this simple example, so writing out the equivalents as in the above answer is not a possibility.
So I would like to have a bit of code that can be applied to a much larger data frame. Perhaps something that looped through all the values of Col2 and matched them to the location of Char.
1=Char[1] 2=Char[2] 3=Char[3]...... for the entire length of Col2
Or any other way that could scale up to a long monstrous data frame

# Values that Col2 might have taken
levels = c(1, 2, 3, 4, 5)
# Labels for the levels in same order as levels
labels = c('A', 'B', 'C', 'D', 'E')
DF$Col3 <- factor(DF$Col2, levels = levels, labels = labels)

I know it may be taboo to use for loops in R, but I tried this out and it worked well.
for (i in length(DF$Col2)) {
DF$Col3[i] <- Char[DF$Col2[i]]
}
Would that be sufficient? I think you could also unique(DF$Col2) or levels(factor(DF$Col2))
Perhaps though I'm misunderstanding your question.

If you wanted to use each column as an index into some vector (I'll use letters so I can index up to 25), returning a data frame of the same dimension of DF, you could use:
transformed <- as.data.frame(lapply(DF, function(x) letters[x]))
head(transformed)
# Col1 Col2
# 1 a a
# 2 b a
# 3 c a
# 4 d a
# 5 e a
# 6 f b
You could then combine this with your original data frame with cbind(DF, transformed).

Why not make a key and join?
library(dplyr)
letter_key = data_frame(letter__ID = 1:26,
letter = letters)
DF %>%
rename(letter__ID = Col2) %>%
left_join(letter_key)
This kind of thing can also be done with factors

Related

Dividing one dataframe into many with names in R

I have some large data frames that are big enough to push the limits of R on my machine; e.g., the one on which I'm currently working is 2 columns by 70 million rows. The contents aren't important, but just in case, column 1 is a string and column 2 is an integer.
What I would like to do is split that data frame into n parts (say, 20, but preferably something that could change on a case-by-case basis) so that I can work on each of the smaller data frames one at a time. That means that (a) the result has to produce things that are named (e.g., "newdf_1", "newdf_2", ... "newdf_20" or something), and (b) each line in the original data frame needs to be in one (and only one) of the new "sub" data frames. The order does not matter, but doing it sequentially by rows makes sense to me.
Once I do the work, I will start to recombine them (using rbind()) one pair at a time.
I've looked at split(), but from what I can tell, it is designed to work with factors (which I don't have).
Any ideas?
You can create a new column and split the data frame based on that column. The column does not need to be a factor, but need to be a data type that can be converted to a factor by the split function.
# Number of groups
N <- 20
dat$group <- 1:nrow(dat) %% N
# Add 1 to group
dat$group <- dat$group + 1
# Split the dat by group
dat_list <- split(dat, f = ~group)
# Set the name of the list
names(dat_list) <- paste0("newdf_", 1:N)
Data
set.seed(123)
# Create example data frame
dat <- data.frame(
A = sample(letters, size = 70000000, replace = TRUE),
B = rpois(70000000, lambda = 1)
)
Here's a tidyverse based solution. Try using read_csv_chunked().
# practice data
tibble(string = sample(letters, 1e6, replace = TRUE),
value = rnorm(1e6) %>%
write_csv("test.csv")
# here's the solution
partial_data <- read_csv_chunked("test.csv",
DataFrameCallback$new(function(x, pos) filter(x, string == "a")),
chunk_size = 1000)
You can wrap the call to read_csv_chunked in a function where you change the string that you subset on.
This is more or less a repeat of this question:
How to read only lines that fulfil a condition from a csv into R?

R replace identical column character items with increasing number

I have a data frame with 60000 obs. of 4 variables in the following format:
I need to replace all character items in the first column with the same character with the number 1. So "101-startups" is 1, "10i10-aps" is 2, 10x is 3 and all 10x-fund-lp are 4 and so on. The same for the second column.
How do I achieve this?
If I'm understanding your question correctly, all you need to do is something like:
my_data$col_1 <- as.integer(factor(my_data$col1, levels = unique(my_data$col1))
my_data$col_2 <- as.integer(factor(my_data$col2, levels = unique(my_data$col2))
Probably a good idea to read up on factors
Try building a separate dataframe from the unique entries of that column, then use the row names (which will be consecutive integers). If your dataframe is df and that first column is v1, something like
x = data.frame(v1 = unique(df$v1))
x$numbers = row.names(x)
Then you can do some kind of merge
final.df = merge(x, df, by = "v1")
and then using something like dplyr to select/drop/rearrange columns

Unused factor levels not dropped after subsetting?

I have a variable df1$StudyAreaVisitNote which I turn into a factor. But when I subsetted the df1 into BS this variable did not remain a factor: using the table( ) function on the subsetted data would show results that seemed to be what should be returned if table() was run on the original data?
Why does this happen?
The two workarounds I found were:
export the subsetted data and re-import
after subsetting, designate the column as a factor again
Code:
# My dataset can be found here: http://textuploader.com/9tx5 (I'm sure there's a better way to host it, but I'm new, sorry!)
# Load Initial Dataset (df1)
df1 <- read.csv("/Users/user/Desktop/untitled folder/pre_subset.csv", header=TRUE,sep=",")
# Make both columns factors
df1$Trap.Type <- factor(df1$Trap.Type)
df1$StudyAreaVisitNote <-factor(df1$StudyAreaVisitNote)
# Subset out site of interest
BS <- subset(df1, Trap.Type=="HR-BA-BS")
# Export to Excel, save as CSV after it's in excel
library(WriteXLS)
WriteXLS("BS", ExcelFileName = "/Users/user/Desktop/test.xlsx", col.names = TRUE, AdjWidth = TRUE, BoldHeaderRow = TRUE, FreezeRow = 1)
# Load second Dataset (df2)
df2 <- read.csv("/Users/user/Desktop/untitled folder/post_subset.csv", header=TRUE, sep=",")
# both datasets should be identical, and they are superficially, but...
# Have a look at df2
summary(df2$StudyAreaVisitNote) # Looks good, only counts levels that are present
# Now, look at BS from df1
summary(BS$StudyAreaVisitNote) # sessions not present in the subsetted data (but present in df1?) are included???
# Make BS$StudyAreaVisitNote a factor...Again??
BS$StudyAreaVisitNote <- factor(BS$StudyAreaVisitNote)
# Try line 31 again
summary(BS$StudyAreaVisitNote) # this time it works, why is factor not carried through during subset?
A factor is maintained a factor even after subsetting. I'm sure class(BS$StudyAreaVisitNote)=="factor". But, factors don't automatically drop their unused levels. This can be helpful when you are doing stuff like
set.seed(16)
dd<-data.frame(
gender=sample(c("M","F"), 25, replace=T),
age=rpois(25, 20)
)
dd
table(subset(dd, age<15)$gender)
# F M
# 0 3
Here the factor remember that it had M and F's and even if the subset doesn't have any F's the levels are still retained. You may explicitly call droplevels() if you want to get rid of unused levels.
table(droplevels(subset(dd, age<15))$gender)
# M
# 3
(now it forgot about the F's)
So instead of summary, compare the results of table on your two data.frames.

Recoding sequentially-named variables based on values of answers

I'm struggling with using lapply to recode values parsimoniously.
Let's say I have 10 survey questions with 4 answers each, in which there is always one right or wrong answer. The questions are labeled q_1 through q_10, and my dataframe is called df. I'd like to create new variables with the same sequential labels that simply code the question as "right" (1) or "wrong" (0).
If I were to make a list of the right answers, it would be:
right_answers<-c(1,2,3,4,2,3,4,1,2,4)
Then, I'm trying to write a function that simply recodes all of the variables into new variables while using the same sequential identifier, such as
lapply(1:10, function(fx) {
df$know_[fx]<-ifelse(df$q_[fx]==right_answers[fx],1,0)
})
In a hypothetical universe where this code was remotely correct, I'd get results such that:
id q_1 know_1 q_2 know_2
1 1 1 2 1
2 4 0 3 0
3 3 0 2 1
4 4 0 1 0
Thanks so much for your help!
For the same matrix output as the other answers, I would suggest:
q_names <- paste0("q_", seq_along(right_answers))
answers <- df[q_names]
correct <- mapply(`==`, answers, right_answers)
This should give you a matrix of whether or not each answer was correct:
t(apply(test[,grep("q_", names(test))], 1, function(X) X==right_answers))
You are likely having trouble with this part of the codedf$q_[fx]. You could call the column names using paste. Such as:
df = read.table(text = "
id q_1 q_2
1 1 2
2 4 3
3 3 2
4 4 1", header = TRUE)
right_answers = c(1,2,3,4,2,3,4,1,2,4)
dat2 = sapply(1:2, function(fx) {
ifelse(df[paste("q",fx,sep = "_")]==right_answers[fx],
1,0)
})
This doesn't add columns to your data.frame, but instead makes a new matrix much like #SenorO's answer. You can name the columns in the matrix and then add them to the original data.frame as follows.
colnames(dat2) = paste("know", 1:2, sep = "_")
data.frame(df, dat2)
I'd like to suggest a different approach to your question, using the reshape2 package. In my opinion, this has the advantages of being: 1) more idiomatic R (for what that's worth), 2) more readable code, 3) less error prone, particularly if you want to add analysis in the future. In this approach, everything is done within dataframes, which I think is desirable when possible -- easier to keep all the values for a single record (id in this case) and easier to use the power of R tools.
# Creating a dataframe with the form you describe
df <- data.frame(id=c('1','2','3','4'), q_1 = c(1,4,3,4), q_2 = c(2,3,2,1), q_3 = rep(1, 4), q_4 = rep(2, 4), q_5 = rep(3, 4),
q_6 = rep(4,4), q_7 = c(1,4,3,4), q_8 = c(2,3,2,1), q_9 = rep(1, 4), q_10 = rep(2, 4))
right_answers<-c(1,2,3,4,2,3,4,1,2,4)
# Associating the right answers explicitly with the corresponding question labels in a data frame
answer_df <- data.frame(questions=paste('q', 1:10, sep='_'), right_answers)
library(reshape2)
# "Melting" the dataframe from "wide" to "long" form -- now questions labels are in variable values rather than in column names
melt_df <- melt(df) # melt function is from reshape2 package
# Now merging the correct answers into the data frame containing the observed answers
merge_df <- merge(melt_df, answer_df, by.x='variable', by.y='questions')
# At this point comparing the observed to correct answers is trivial (using as.numeric to convert from logical to 0/1 as you request, though keeping as TRUE/FALSE may be clearer)
merge_df$correct <- as.numeric(merge_df$value==merge_df$right_answers)
# If desireable (not sure it is), put back into "wide" dataframe form
cast_obs_df <- dcast(merge_df, id ~ variable, value.var='value') # dcast function is from reshape2 package
cast_cor_df <- dcast(merge_df, id ~ variable, value.var='correct')
names(cast_cor_df) <- gsub('q_', 'know_', names(cast_cor_df))
final_df <- merge(cast_obs_df, cast_cor_df)
The new tidyr package would probably be even better here than reshape2.

Vectorized meta data computation based on multiple columns on R data.frame

I have a data.frame with 3 columns, each of which can be thought of as a factor. I'd like to compute some stats on the data.frame and store it in a new frame. To be more specific, I have the following fields:
obs, len, src
A 10 X
B 10 Y
I'd like to compute the breakdown of each source at each length (i.e. what percentage of observations from source X that are of length 10 are "A", "B", etc.)
An obvious approach to this is to use two for loops to iterate over the lengths and sources and then use nrow() and count() to get the values I'd need to compute, like so:
relevant_subset <- data[data$src==source & data$len==length,]
breakdown_info <- count(relevant_subset)
breakdown_info$frac <- breakdown_info$freq / nrow(relevant_subset)
Is there a way to avoid using the double for loop and use a more vectorized approach? Is there a smart way to pre-allocate the new frame that would hold the modified breakdown_info for each length and source?
aggregate is your friend for these tasks:
Example data:
set.seed(23)
test <- data.frame(
obs=sample(LETTERS[1:2],20,replace=TRUE),
len=sample(c(10,20),20,replace=TRUE),
src=sample(LETTERS[24:25],20,replace=TRUE)
)
Aggregate it:
aggregate(obs ~ src + len,data=test, function(x) prop.table(table(x)))
src len obs.A obs.B
1 X 10 0.6000000 0.4000000
2 Y 10 0.2000000 0.8000000
3 X 20 0.2500000 0.7500000
4 Y 20 0.1666667 0.8333333
This is what the plyr package was made for!
The format is <input_type><output_type>ply. For example if the input is a data.frame and you want the output to be a data.frame use ddply.
To use it, you specify the input data.frame, the columns to group by and then a function that constructs a data.frame from each group. The resulting data.frames appended with the grouping columns are assembled together into the output data.frame.
In something similar to your example, you could do
require(plyr)
a <- data.frame(
obs=factor(c('A','A','A','B','B')),
len=c(10,10,10,10,210),
src=factor(c('X','X','Y','Y','Z')))
then
z <- ddply(
a,
.(obs),
function(df){
data.frame(mean.len=mean(df$len))
})
would produce
data.frame(
obs=c('A', 'B'),
mean.length(10, 110))
while
ddply(a, .(src), function(df){
data.frame(
num.obs.A = sum(df$obs == 'A'),
num.obs.B = sum(df$obs == 'B'))})
would produce
data.frame(
src=c('X','Y', 'Z'),
num.obs.A = c(3,1,0),
num.obs.B = c(0,1,1))
The website is http://plyr.had.co.nz/ has good documentation too.
You haven't stated a reason why you want a data.frame here as output. Perhaps it's best for you, perhaps not. You also aren't really clear on what proportions are what but I think the following might solve your problem best.
prop.table( table(test) )
You could enter it slightly differently and play with the order of columns so that what you want to compare is most easily examined. But, this output is a 3-dimensional array and quite a bit different from a data.frame.
(example of alternate usage)
prop.table(with(test, table(src, obs, len) ))

Resources