Convert factors to numbers in a categorical data quickly - r

I have a big categorical data frame like
col1 col2 col3
abcd rweff 3433534
gfds erwq trdfs
abcd erwq trdfs
abcd rweff 3433534
......
I want to replace all these complicated categories to simple numbers, something like this
col1 col2 col3
1 2 1
2 1 2
1 1 2
1 2 1
......
How do I quickly achieve it in R?

Assuming that the columns are of 'factor' class
df1[] <- lapply(df1, as.numeric)
df1
# col1 col2 col3
#1 1 2 1
#2 2 1 2
#3 1 1 2
#4 1 2 1
If the columns are 'character' class, then convert to 'factor' and use as.numeric
df1[] <- lapply(df1, function(x) as.numeric(factor(x)))
These are similar options using dplyr or data.table. It may be faster (haven't benchmarked)
library(dplyr)
df1 <- mutate_each(df1, funs(as.numeric(.)))
If you use %<>% from magrittr, can avoid assigning to a new object or the existing one.
library(magrittr)
df1 %<>%
mutate_each(funs(as.numeric(.)))
Or
library(data.table)
setDT(df1)[, lapply(.SD, as.numeric)]
Or a bit more efficient method with set as it modifies columns by reference and the overhead of [.data.table is avoided
setDT(df1)
for(j in 1:ncol(df1)){
set(df1, i=NULL, j=j, value= as.numeric(df1[[j]]))
}
data
df1 <- structure(list(col1 = structure(c(1L, 2L, 1L, 1L),
.Label = c("abcd",
"gfds"), class = "factor"), col2 = structure(c(2L, 1L, 1L, 2L
), .Label = c("erwq", "rweff"), class = "factor"),
col3 = structure(c(1L,
2L, 2L, 1L), .Label = c("3433534", "trdfs"), class = "factor")),
.Names = c("col1",
"col2", "col3"), row.names = c(NA, -4L), class = "data.frame")

Related

merge list of data frames to single data frame by all rows [duplicate]

This question already has answers here:
Simultaneously merge multiple data.frames in a list
(9 answers)
Closed 3 years ago.
I have a list of data frames which looks like this:
df1
col1 col2
house. 10
cat. 5
dog 7
mouse 4
df2
col1 col2
house. 6
apple. 4
dog 8
elephant 3
df3
col1 col2
horse 1
banana 1
dog 8
The desired output would be:
df1. df2. df3
house. 10 6. NA
cat 5 NA. NA
dog 7 8 8
mouse. 4. NA. NA
apple. NA 4. NA
elephant. NA 3. NA
horse. NA. NA. 1
banana. NA. NA. 1
Any suggestion?
I tried to do the following:
list_df<-list(df1,df2,df3)
df_all<-do.call("rbind", list_df)
df_merge<-as.data.frame(unique(df_all$col1))
colnames(df_merge)<-"category"
df_merge$df1 <- with(df_merge, ifelse (category %in% df1$col1,df1$col2,NA))
however, when I add the second data frame I get this error:
$ operator is invalid for atomic vectors
Using dplyr:
library(dplyr)
df <- dplyr::full_join(df1, df2, by = "col1")
df <- dplyr::full_join(df, df3, by = "col1")
df %>%
column_to_rownames(var = "col1")
# col2.x col2.y col2
#house. 10 6 NA
#cat. 5 NA NA
#dog 7 8 8
#mouse 4 NA NA
#apple. NA 4 NA
#elephant NA 3 NA
#horse NA NA 1
#banana NA NA 1
update: if you have many data frames. You can use reduce from purrr:
library(tidyverse)
list(df1, df2, df3) %>% reduce(full_join, by = "col1") ## this would help
data
df1 <- structure(list(col1 = structure(c(3L, 1L, 2L, 4L), .Label = c("cat.", "dog", "house.", "mouse"), class = "factor"), col2 = c(10L, 5L, 7L, 4L)), class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(col1 = structure(c(4L, 1L, 2L, 3L), .Label = c("apple.", "dog", "elephant", "house."), class = "factor"), col2 = c(6L, 4L, 8L, 3L)), class = "data.frame", row.names = c(NA, -4L))
df3 <- structure(list(col1 = structure(c(3L, 1L, 2L), .Label = c("banana", "dog", "horse"), class = "factor"), col2 = c(1L, 1L, 8L)), class = "data.frame", row.names = c(NA, -3L))

Mapping values across a dataframe

I have a large dataset. The example below is a much abbreviated version.
There are two dataframes, df1 and df2. I would like to map to each row of df1, a derived value using conditions from df2 with arguments from df1.
Hope the example below makes more sense
year <- rep(1996:1997, each=3)
age_group <- rep(c("20-24","25-29","30-34"),2)
df1 <- as.data.frame(cbind(year,age_group))
df1 is a database with all permutations of year and age group.
df2 <- as.data.frame(rbind(c(111,1997,"20-24"),c(222,1997,"30-34")))
names(df2) <- c("id","year","age.group")
df2 is a database where each row represents an individual at a particular year
I would like to use arguments from df1 conditional on values from df2 and then to map to df1. The arguments are as follows:
each_yr <- map(df1, function(year,age_group) case_when(
as.character(df1$year) == as.character(df2$year) & as.character(df1$age_group)
== as.character(df2$age.group)~ 0,
TRUE ~ 1))
The output i get is wrong and shown below
structure(list(year = c(1, 1, 1, 1, 1, 0), age_group = c(1, 1,
1, 1, 1, 0)), .Names = c("year", "age_group"))
The output i would ideally like is something like this (dataframe as an example but would be happy as a list)
structure(list(year = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("1996",
"1997"), class = "factor"), age_group = structure(c(1L, 2L, 3L,
1L, 2L, 3L), .Label = c("20-24", "25-29", "30-34"), class = "factor"),
v1 = structure(c(2L, 2L, 2L, 1L, 2L, 2L), .Label = c("0",
"1"), class = "factor"), v2 = structure(c(2L, 2L, 2L, 2L,
2L, 1L), .Label = c("0", "1"), class = "factor")), .Names = c("year",
"age_group", "v1", "v2"), row.names = c(NA, -6L), class = "data.frame")
I have used map before when 'df1' is a vector but in this scenario it is a dataframe where both columns are used as arguments. Can Map handle this?
In df3 the column v1 is the result of conditions based on df1 and df2 and then mapped to df1 for patient '111'. Likewise column v2 is the outcome for patient '222'.
Thanks in advance
Looks like some work for pmap instead. And a touch of tidyr to get the suggested result.
purrr::pmap(list(df2$id,as.character(df2$year),as.character(df2$age.group)),
function(id,x,y)
data.frame(df1,
key=paste0("v",id),
value=1-as.integer((x==df1$year)&(y==df1$age_group)),
stringsAsFactors=FALSE
)) %>%
replyr::replyr_bind_rows() %>% tidyr::spread(key,value)
# year age_group v1 v2
#1 1996 20-24 1 1
#2 1996 25-29 1 1
#3 1996 30-34 1 1
#4 1997 20-24 0 1
#5 1997 25-29 1 1
#6 1997 30-34 1 0
Whithing tidiverse you can do it this way:
library(tidyverse)
#library(dplyr)
#library(tidyr)
df2 %>%
mutate(tmp = 0) %>%
spread(id, tmp, fill = 1, sep = "_") %>%
right_join(df1, by = c("year", "age.group" = "age_group")) %>%
mutate_at(vars(-c(1, 2)), coalesce, 1)
# year age.group id_111 id_222
# 1 1996 20-24 1 1
# 2 1996 25-29 1 1
# 3 1996 30-34 1 1
# 4 1997 20-24 0 1
# 5 1997 25-29 1 1
# 6 1997 30-34 1 0
#Warning messages:
# 1: Column `year` joining factors with different levels, coercing to character vector
# 2: Column `age.group`/`age_group` joining factors with different levels, coercing to
# character vector

how to count and remove similar strings across columns

I have a data with many columns . for example this is with three columns
df<-structure(list(V1 = structure(c(5L, 1L, 7L, 3L, 2L, 4L, 6L, 6L
), .Label = c("CPSIAAAIAAVNALHGR", "DLNYCFSGMSDHR", "FPEHELIVDPQR",
"IADPDAVKPDDWDEDAPSK", "LWADHGVQACFGR", "WGEAGAEYVVESTGVFTTMEK",
"YYVTIIDAPGHR"), class = "factor"), V2 = structure(c(5L, 2L,
7L, 3L, 4L, 6L, 1L, 1L), .Label = c("", "CPSIAAAIAAVNALHGR",
"GCITIIGGGDTATCCAK", "HVGPGVLSMANAGPNTNGSQFFICTIK", "LLELGPKPEVAQQTR",
"MVCCSAWSEDHPICNLFTCGFDR", "YYVTIIDAPGHR"), class = "factor"),
V3 = structure(c(4L, 3L, 2L, 4L, 3L, 1L, 1L, 1L), .Label = c("",
"AVCMLSNTTAIAEAWAR", "DLNYCFSGMSDHR", "FPEHELIVDPQR"), class = "factor")), .Names = c("V1",
"V2", "V3"), class = "data.frame", row.names = c(NA, -8L))
-The first column, we don't look at any other column, we just count how many strings there are and keep the unique one
The second column, we keep the unique and also we remove those that were already in the first column
The third column, we keep the unique and we remove the strings that were in the first and second column
This continues for as many columns as we have
for example for this data, we will have the following
Column 1 Column 2 Column 3
LWADHGVQACFGR
CPSIAAAIAAVNALHGR LLELGPKPEVAQQTR AVCMLSNTTAIAEAWAR
YYVTIIDAPGHR GCITIIGGGDTATCCAK
FPEHELIVDPQR HVGPGVLSMANAGPNTNGSQFFICTIK
DLNYCFSGMSDHR MVCCSAWSEDHPICNLFTCGFDR
IADPDAVKPDDWDEDAPSK
WGEAGAEYVVESTGVFTTMEK
Here is a solution via tidyverse,
library(tidyverse)
df1 <- df %>%
gather(var, string) %>%
filter(string != '' & !duplicated(string)) %>%
group_by(var) %>%
mutate(cnt = seq(n())) %>%
spread(var, string) %>%
select(-cnt)
Which gives
# A tibble: 7 x 4
cnt V1 V2 V3
* <int> <chr> <chr> <chr>
1 1 LWADHGVQACFGR LLELGPKPEVAQQTR AVCMLSNTTAIAEAWAR
2 2 CPSIAAAIAAVNALHGR GCITIIGGGDTATCCAK <NA>
3 3 YYVTIIDAPGHR HVGPGVLSMANAGPNTNGSQFFICTIK <NA>
4 4 FPEHELIVDPQR MVCCSAWSEDHPICNLFTCGFDR <NA>
5 5 DLNYCFSGMSDHR <NA> <NA>
6 6 IADPDAVKPDDWDEDAPSK <NA> <NA>
7 7 WGEAGAEYVVESTGVFTTMEK <NA> <NA>
You can use colSums to get the number of strings,
colSums(!is.na(df1))
#V1 V2 V3
# 7 4 1
A similar approach via base R, that would save the strings in a list would be,
df[] <- lapply(df, as.character)
d1 <- stack(df)
d1 <- d1[d1$values != '' & !duplicated(d1$values),]
l1 <- unstack(d1, values ~ ind)
lengths(l1)
#V1 V2 V3
# 7 4 1
A base R solution. df2 is the final output.
# Convert to character
L1 <- lapply(df, as.character)
# Get unique string
L2 <- lapply(L1, unique)
# Remove ""
L3 <- lapply(L2, function(vec){vec <- vec[!(vec %in% "")]})
# Use for loop to remove non-unique string from previous columns
for (i in 2:length(L3)){
previous_vec <- unlist(L3[1:(i - 1)])
current_vec <- L3[[i]]
L3[[i]] <- current_vec[!(current_vec %in% previous_vec)]
}
# Get the maximum column length
max_num <- max(sapply(L3, length))
# Append "" to each column
L4 <- lapply(L3, function(vec){vec <- c(vec, rep("", max_num - length(vec)))})
# Convert L4 to a data frame
df2 <- as.data.frame(do.call(cbind, L4))

Pick a column to multiply with, contingent on value of other variables

I am still doing my first footsteps with R and found SO to be a great tool for learning more and finding answers to my questions. For this one i though did not manage to find any good solution here.
I have a dataframe that can be simplified to this structure:
set.seed(10)
df <- data.frame(v1 = rep(1:2, times=3),
v2 = c("A","B","B","A","B","A"),
v3 = sample(1:6),
xA_1 = sample(1:6),
xA_2 = sample(1:6),
xB_1 = sample(1:6), xB_2 = sample(1:6))
df thus looks like this:
> df
v1 v2 v3 xA_1 xA_2 xB_1 xB_2
1 1 A 4 2 1 3 3
2 2 B 2 6 3 5 4
3 1 B 5 3 2 4 5
4 2 A 3 5 4 2 1
5 1 B 1 4 6 6 2
6 2 A 6 1 5 1 6
I now want R to create a fourth variable, which is dependent on the values of v1 and v2. I achieve this by using the following code:
df <- data.table(df)
df[, v4 := ifelse(v1 == 1 & v2 == "A", v3*xA_1,
ifelse(v1 == 1 & v2 == "B", v3*xB_1,
ifelse(v1 == 2 & v2 == "A", v3*xA_2,
ifelse(v1 == 2 & v2 == "B", v3*xB_2, v3*1))))]
So v4 is created by multiplying v3 with the column that contains the v1 and the v2 value
(e.g. for row 1: v1=1 and v2=A thus multiply v3=4 with xA_1=2 -> 8).
> df$v4
[1] 8 8 20 12 6 30
Obviuosly, my ifelse approach is tedious when v1 and v2 in fact have many more different values than they have in this example. So I am looking for an efficient way to tell R if v1 == y & v2 == z, multiply v3 with column xy_z.
I tried writing a for-loop, writing a function that has y and z as index and using the apply function. However none of this worked as wanted.
I appreciate any ideas!
Here's a base R option:
i <- paste0("x", df$v2, "_", df$v1)
df$v4 <- df$v3 * as.numeric(df[cbind(1:nrow(df), match(i, names(df)))])
For the sample data provided below, it creates a column v4 as:
> df$v4
[1] 25 12 2 6 3 10
Or if you want to include the "else" condition to multiply by 1 in case there's no matching column name:
i <- paste0("x", df$v2, "_", df$v1)
tmp <- as.numeric(df[cbind(1:nrow(df), match(i, names(df)))])
df$v4 <- df$v3 * ifelse(is.na(tmp), 1, tmp)
Sample data:
df <- structure(list(v1 = c(1L, 2L, 1L, 2L, 1L, 2L), v2 = structure(c(1L,
2L, 2L, 1L, 2L, 1L), .Label = c("A", "B"), class = "factor"),
v3 = c(5L, 4L, 1L, 6L, 3L, 2L), xA_1 = c(5L, 6L, 3L, 1L,
2L, 4L), xA_2 = c(6L, 4L, 2L, 1L, 3L, 5L), xB_1 = c(4L, 6L,
2L, 5L, 1L, 3L), xB_2 = c(5L, 3L, 2L, 4L, 1L, 6L)), .Names = c("v1",
"v2", "v3", "xA_1", "xA_2", "xB_1", "xB_2"), row.names = c(NA,
-6L), class = "data.frame")
This is a standard "wide" table problem - what you want is harder to do as-is, but easy when the data is "melted":
dt = as.data.table(df)
melt(dt, id.vars = c('v1', 'v2', 'v3'))[variable == paste0('x', v2, '_', v1)
][dt, on = c('v1', 'v2', 'v3'), v3 * value]
#[1] 8 8 20 12 6 30
You can try this :
v4 <- c()
for(i in 1:nrow(df)){
col <- paste("x",df$v2[i],"_",df$v1[i],sep="")
v4 <- c(v4,df$v3[i]*df[i,col])
}
df$v4 <- v4

delete the rows with duplicated ids

I want to delete the rows with duplicated ids
data
id V1 V2
1 a 1
1 b 2
2 a 2
2 c 3
3 a 4
The problem is that some people did the test for a few times, which generate multiple scores on V2, I want to delete the duplicated id and retain one of the scores in V2 randomly.
output
id V1 V2
1 a 1
2 a 2
3 a 4
I tried this:
neu <- unique(neu$userid)
but it didn't work
Using dplyr:
library(dplyr)
set.seed(1)
df %>% sample_frac(., 1) %>% arrange(id) %>% distinct(id)
Output:
id V1 V2
1 1 b 2
2 2 c 3
3 3 a 4
Data:
df <- structure(list(id = c(1L, 1L, 2L, 2L, 3L), V1 = structure(c(1L,
2L, 1L, 3L, 1L), .Label = c("a", "b", "c"), class = "factor"),
V2 = c(1L, 2L, 2L, 3L, 4L)), .Names = c("id", "V1", "V2"), class = "data.frame", row.names = c(NA,
-5L))
Creating the data frame based on your example:
df <- read.table(text =
"id V1 V2
1 a 1
1 b 2
2 a 2
2 c 3
3 a 4", h = T)
Since you want to remove rows randomly, first sort the rows of your data frame randomly:
df <- df[sample(nrow(df)),]
Then remove duplicates in the order of appearence:
df <- df[!duplicated(df$id),]
Now sort your data frame back:
df <- df[with(df, order(id)),]
Remember to change df by your data frame name.

Resources