R replace identical column character items with increasing number - r

I have a data frame with 60000 obs. of 4 variables in the following format:
I need to replace all character items in the first column with the same character with the number 1. So "101-startups" is 1, "10i10-aps" is 2, 10x is 3 and all 10x-fund-lp are 4 and so on. The same for the second column.
How do I achieve this?

If I'm understanding your question correctly, all you need to do is something like:
my_data$col_1 <- as.integer(factor(my_data$col1, levels = unique(my_data$col1))
my_data$col_2 <- as.integer(factor(my_data$col2, levels = unique(my_data$col2))
Probably a good idea to read up on factors

Try building a separate dataframe from the unique entries of that column, then use the row names (which will be consecutive integers). If your dataframe is df and that first column is v1, something like
x = data.frame(v1 = unique(df$v1))
x$numbers = row.names(x)
Then you can do some kind of merge
final.df = merge(x, df, by = "v1")
and then using something like dplyr to select/drop/rearrange columns

Related

Dividing one dataframe into many with names in R

I have some large data frames that are big enough to push the limits of R on my machine; e.g., the one on which I'm currently working is 2 columns by 70 million rows. The contents aren't important, but just in case, column 1 is a string and column 2 is an integer.
What I would like to do is split that data frame into n parts (say, 20, but preferably something that could change on a case-by-case basis) so that I can work on each of the smaller data frames one at a time. That means that (a) the result has to produce things that are named (e.g., "newdf_1", "newdf_2", ... "newdf_20" or something), and (b) each line in the original data frame needs to be in one (and only one) of the new "sub" data frames. The order does not matter, but doing it sequentially by rows makes sense to me.
Once I do the work, I will start to recombine them (using rbind()) one pair at a time.
I've looked at split(), but from what I can tell, it is designed to work with factors (which I don't have).
Any ideas?
You can create a new column and split the data frame based on that column. The column does not need to be a factor, but need to be a data type that can be converted to a factor by the split function.
# Number of groups
N <- 20
dat$group <- 1:nrow(dat) %% N
# Add 1 to group
dat$group <- dat$group + 1
# Split the dat by group
dat_list <- split(dat, f = ~group)
# Set the name of the list
names(dat_list) <- paste0("newdf_", 1:N)
Data
set.seed(123)
# Create example data frame
dat <- data.frame(
A = sample(letters, size = 70000000, replace = TRUE),
B = rpois(70000000, lambda = 1)
)
Here's a tidyverse based solution. Try using read_csv_chunked().
# practice data
tibble(string = sample(letters, 1e6, replace = TRUE),
value = rnorm(1e6) %>%
write_csv("test.csv")
# here's the solution
partial_data <- read_csv_chunked("test.csv",
DataFrameCallback$new(function(x, pos) filter(x, string == "a")),
chunk_size = 1000)
You can wrap the call to read_csv_chunked in a function where you change the string that you subset on.
This is more or less a repeat of this question:
How to read only lines that fulfil a condition from a csv into R?

Remove Duplicates, but Keep the Most Complete Iteration

I'm trying to figure out how remove duplicates based on three variables (id, key, and num). I would like to remove the duplicate with the least amount of columns filled. If an equal number are filled, either can be removed.
For example,
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
The output would be the following:
Finished <- data.frame(id= c(1,2,3,4,5),
key=c(1,2,3,4,5),
num=c(1,1,1,1,1),
v4= c(1,5,5,5,7),
v5=c(1,5,5,5,7))
My real dataset is bigger and a mix of mostly numerical, but some character variables, but I couldn't determine the best way to go about doing this. I've previously used a program that would do something similar within the duplicates command called check.all.
So far, my thoughts have been to use grepl and determine where "anything" is present
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
Then, using the resultant dataframe I ask for rowSums and Cbind it to the original.
CompleteNess <- rowSums(Present)
cbind(Original, CompleteNess)
This is the point where I'm unsure of my next steps... I have a variable which tells me how many columns are filled in each row (CompleteNess); however, I'm unsure of how to implement duplicates.
Simply, I'm looking for When id, key, and num are duplicated - keep the row with the highest value of CompleteNess.
If anybody can think of a better way to do this or get me through the last little bit I would greatly appreciate it. Thanks All!
Here is a solution. It is not very pretty but it should work for your application:
#Order by the degree of completeness
Original<-Original[order(CompleteNess),]
#Starting from the bottom select the not duplicated rows
#based on the first 3 columns
Original[!duplicated(Original[,1:3], fromLast = TRUE),]
This does rearrange your original data frame so beware if there is additional processing later on.
You can aggregate your data and select the row with max score:
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
#get the score
Original$present <- rowSums(Present)
#create a column to aggregate on
Original$id.key.num <- paste(Original$id, Original$key, Original$num, sep = "-")
library("plyr")
#aggregate here
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present))
And if you want to keep the other columns, just do this:
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present),
v4 = v4[which.max(present)],
v5 = v5[which.max(present)]
)

In R, I have two columns and would like to take the sum if a condition is met

I have am trying to write a script in R where I would take sum of values correspnding to a condition from another column.
Say I have two columns, fakeVector & fakeVector1 of table "total"
fakeVector = c('NTC.H3','NTC.F2','NTC.F22','abc123','sample1')
fakeVector1 = c('1','2','3','4','5')
total=rbind(fakeVector, fakeVector1)
I want to get the values for fakeVector1 where fakeVector = specific value.
For example, I would like to grab the fakeVector1 value where fakeVector = specific value, for example "NTC.H3"
How would I do that?
We can try
sum(as.numeric(total["fakeVector1",][total["fakeVector",]=="NTC.H3"]))
total[2,][which(total[1,] == "NTC.H3")]
#[1] "1"
v1 <- c('NTC.H3', 'NTC.F22', 'abc123')
sum(as.numeric(total[2,][which(total[1,] %in% v1)]))
#[1] 8
If your data set is organized as a data.frame and if you want to know the sum of one column for every condition in another column of, you can use the fast data.table package.
# load library
library(data.table)
# get your data
fakeVector = c('NTC.H3','NTC.F2','NTC.F22','abc123','sample1')
fakeVector1 = c('1','2','3','4','5')
total=cbind(fakeVector, fakeVector1)
total <- as.data.table(total)
total$fakeVector1 <- as.numeric(total$fakeVector1)
# Solution
total[, .(mysum = sum(fakeVector1)), by=.(fakeVector)]

Create a new column based on values from other variables

I have data that looks like this:
A set of 10 character variables
Char<-c("A","B","C","D","E","F","G","H","I","J")
And a data frame that looks like this
Col1<-seq(1:25)
Col2<-c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5)
DF<-data.frame(Col1,Col2)
What I would like to do is to add a third column to the data frame, with the logic that 1=A, 2=B, 3= C and so on. So the end result would be
Col3<-c("A","A","A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D","E","E","E","E","E")
DF<-data.frame(Col1,Col2,Col3)
For this simple example I could go with a simple substitution like this question:
Create new column based on 4 values in another column
But my actual data set is much bigger with a lot more variables than this simple example, so writing out the equivalents as in the above answer is not a possibility.
So I would like to have a bit of code that can be applied to a much larger data frame. Perhaps something that looped through all the values of Col2 and matched them to the location of Char.
1=Char[1] 2=Char[2] 3=Char[3]...... for the entire length of Col2
Or any other way that could scale up to a long monstrous data frame
# Values that Col2 might have taken
levels = c(1, 2, 3, 4, 5)
# Labels for the levels in same order as levels
labels = c('A', 'B', 'C', 'D', 'E')
DF$Col3 <- factor(DF$Col2, levels = levels, labels = labels)
I know it may be taboo to use for loops in R, but I tried this out and it worked well.
for (i in length(DF$Col2)) {
DF$Col3[i] <- Char[DF$Col2[i]]
}
Would that be sufficient? I think you could also unique(DF$Col2) or levels(factor(DF$Col2))
Perhaps though I'm misunderstanding your question.
If you wanted to use each column as an index into some vector (I'll use letters so I can index up to 25), returning a data frame of the same dimension of DF, you could use:
transformed <- as.data.frame(lapply(DF, function(x) letters[x]))
head(transformed)
# Col1 Col2
# 1 a a
# 2 b a
# 3 c a
# 4 d a
# 5 e a
# 6 f b
You could then combine this with your original data frame with cbind(DF, transformed).
Why not make a key and join?
library(dplyr)
letter_key = data_frame(letter__ID = 1:26,
letter = letters)
DF %>%
rename(letter__ID = Col2) %>%
left_join(letter_key)
This kind of thing can also be done with factors

Computing subset of column means in data frame (R programming)

I have a simple data frame:
a=data.frame(first=c(1,2,3),second=c(3,4,5),third=c('x','y','z'))
I'm trying to return a data frame that contains the column means for just the first and second columns. I've been doing it like this:
apply(a[,c('first','second')],2,mean)
Which returns the appropriate output:
first second
2 4
However, I want to know if I can do it using the function by. I tried this:
by(a, c("first", "second"), mean)
Which resulted in:
Error in tapply(seq_len(3L), list(`c("first", "second")` = c("first", :
arguments must have same length
Then, I tried this:
by(a, c(T, T,F), mean)
Which also did not yield the correct answer:
c(T,T,F): FALSE
[1] NA
Any suggestions? Thanks!
You can use colMeans (column means) on a subset of the original data
> a <- data.frame(first = c(1,2,3), second = c(3,4,5), third = c('x','y','z'))
If you know the column number, but not the column name,
> colMeans(a[, 1:2])
## first second
## 2 4
Or, if you don't know the column numbers but know the column name,
> colMeans(a[, c("first", "second")])
## first second
## 2 4
Finally, if you know nothing about the columns and want the means for the numeric columns only,
> colMeans(a[, sapply(a, is.numeric)])
## first second
## 2 4
by() is not the right tool, because it is a wrapper for tapply(), which partitions your data frame into subsets that meet some criteria. If you had another column, say fourth, you could split your data frame using by() for that column and then operate on rows or columns using apply().

Resources