I found a similar question asked before. My question is a bit more complex than the previous one. For my question, the y parameter is not fixed.
In the function(X,Y){SOME FUNCTION}, X is a list of characters and Y is a list of dataframe. Basically, I want the function to work on the pair of X and Y in sequence respectively, and produce the output as one list. For example, the first element of X list and the first element of Y list, the second element of X list and the second element of Y list, the third element of X list and the third element of Y list,...
Example of X, Y
X <- c("1", "2")
y1 <- data.frame("person.1" = "Amy", "bestfood..1" = "fish", "bestthing..1" = "book",
"person.2" = "Mike", "bestfood..2" = "fish", "bestthing..2" = "book")
y2 <- data.frame("person.1" = "Amy","bestfood..1" = "carrot", "bestthing..1" = "cloth",
"person.2" = "Mike","bestfood..2" = "carrot", "bestthing..2" = "cloth")
Y <- list(y1,y2)
The function:
addID <- function(X, Y) {
rowlength <- length(Y)
df <- as.data.frame(matrix(NA, nrow = rowlength, ncol = 3))
colnames(df) <- c("ID", "Person", "Food")
df[1:nrow(df), 1] <- X
# name
namecols <-grep("person",colnames(Y))
for (i in 1:length(namecols)) {
name <- Y[1, namecols[i]]
df[i, 2] <- as.character(name)
}
# food
foodcols <-
grep("bestfood",colnames(Y))
for (i in 1:length(foodcols)) {
food <- Y[1, foodcols[i]]
df[i, 3] <- as.character(foodcols)
}
return(df)
}
}
I tried to use lapply but can't figure out the way to include the X list. When I try this:
lapply(Y, function, X=X)
The function doesn't work properly. I wonder if there are other ways to include X in it(I tried the function on individual character and dataframe, it works just fine. )
I hope this is clear. If not, please point it out, I will try my best to clarify. Thanks in advance.
UPDATE:
I tried Map as suggested by comments. It returns: incorrect number of dimensions. I added some details in the function. It seems like R stucks on the last line.
outcome <- Map(addID, Y, X)
I get
error in Y[1, namecols[i]] : incorrect number of dimensions
In addition: Warning message:
In `[<-.data.frame`(`*tmp*`, 1:nrow(df), 1, value = list(person.1 = 1L, :
provided 6 variables to replace 1 variables
The outcome should looks like:
z1 <- data.frame(ID = c(1,2), Person = c("Amy","Mike"), Food = c("fish", "fish"))
z2 <- data.frame(ID = c(1,2), Person = c("Amy","Mike"), Food = c("carrot", "carrot"))
outcome <- list(z1,z2)
We could do this easily in tidyverse
library(dplyr)
library(tidyr)
bind_rows(Y, .id = 'ID') %>%
select(ID, starts_with('person'), contains('food')) %>%
pivot_longer(cols = -ID, names_to = c(".value"),
names_pattern = "([^.]+)\\.+\\d+")
-output
# A tibble: 4 x 3
ID person bestfood
<chr> <chr> <chr>
1 1 Amy fish
2 1 Mike fish
3 2 Amy carrot
4 2 Mike carrot
With the OP's function, if we modify, it would work
addID <- function(X, Y) {
rowlength <- length(Y)
df <- as.data.frame(matrix(NA, nrow = rowlength, ncol = 3))
colnames(df) <- c("ID", "Person", "Food")
df[1:nrow(df), 1] <- X
namecols <- grep("person",colnames(Y))
df[, 2] <- unlist(Y[namecols])
foodcols <- grep("bestfood", colnames(Y))
df[,3] <- unlist(Y[foodcols])
return(unique(df))
}
-testing
Map(addID, X, Y)
$`1`
ID Person Food
1 1 Amy fish
2 1 Mike fish
$`2`
ID Person Food
1 2 Amy carrot
2 2 Mike carrot
Related
I have a list of ~8000 vectors, and I would like to know how many duplicates there are of these 8000 vectors, but the order of the elements in each could be different.
for example:
list <- c()
list[[1]] <- c(1,2,3)
list[[2]] <- c(2,1,3)
list[[3]] <- c(3,2,1)
list[[4]] <- c(4,5)
list[[5]] <- c(5,4)
list[[6]] <- c(1,2,3,5)
should give me a count of 3 for c(1,2,3) and 2 for c(4,5) and 1 for c(1,2,3,5)
I'd like the count of each of the duplicates, not just how many are duplicated.
library(tidyverse)
library(gtools)
get_perm <- function(v) {
m <- permutations(n = length(v), r = length(v), v = v, set = F)
m[order(c(m))]
}
all <- map(list, get_perm)
unique <- map(list, get_perm) %>% unique()
res_vec <- c()
element <- c()
for(i in seq_along(unique)) {
element[[i]] <- unique[[i]] %>% unique() %>% paste(collapse = ",")
res_vec[[i]] <- all %in% unique[i] %>% sum()
}
tibble(
elements = unlist(element),
numbers = res_vec
)
Result
# A tibble: 3 x 2
elements numbers
<chr> <int>
1 1,2,3 3
2 4,5 2
3 1,2,3,5 1
elements contains all the individual elements of the vectors for each group and numbers are the numbers of vectors you have in each group.
We create a function to take vector as an argument ('val'), then loop through the list with sapply, check if all the 'valare%in%the 'x', andsumthe logicalvector`
f1 <- function(lst, val) sum(sapply(lst, function(x) all(val %in% x)))
f1(list, c(1, 2, 3))
[#1] 3
f1(list, c(4, 5))
#[1] 2
The goal is to rename a list of dataframes columns, but while adding the dataframe name to the new column name.
ex: from x to a_x and b_x.
Why? Because I plan to later merge the sets and would like clean ids for the columns.
a = data.frame(x = c(1,2))
b = data.frame(x = c(3,4))
frameList = list(a = a, b = b)
newName = c(*frameName+'_'+'x')
names = lapply(names, setNames, nm = newName)
list2env(names,.GlobalEnv)
Here is one way for you. I looped through each data frame in frameList using the length of frameList. For column names in each data frame, I took the name of a data frame (i.e., names(frameList)) and past it to column names in the data frame.
a = data.frame(x = c(1,2), y = 1:2)
b = data.frame(x = c(3,4), y = 1:2)
frameList = list(a = a, b = b)
lapply(1:length(names(frameList)), function(x) {
names(frameList[[x]]) <- paste(names(frameList)[x], names(frameList[[x]]), sep = "_")
return(frameList[[x]])
})
[[1]]
a_x a_y
1 1 1
2 2 2
[[2]]
b_x b_y
1 3 1
2 4 2
Or another option is Map
Map(function(x, y) setNames(x, paste(y, names(x), sep="_")), frameList, names(frameList))
#$a
# a_x a_y
#1 1 1
#2 2 2
#$b
# b_x b_y
#1 3 1
#2 4 2
Or with tidyverse
library(purrr)
library(dplyr)
f1 <- function(x, y) paste(y, x, sep="_")
map2(frameList, names(frameList), ~ .x %>%
rename_all(f1, .y))
If we need it in the reverse order, this is more simple
map2(frameList, names(frameList), ~ .x %>%
rename_all(paste, sep="_", .y))
This is a follow up question to my previous question. I run into a problem to find a memory efficient solution to find a common third for my large data set (3.5 million groups and 6.2 million persons)
The proposed solution using the igraph package works fast for a normal sized data sets unfortunately runs into memory issues by creating a large matrix for bigger data sets. Similar issue comes up with my own solution using concatenated inner joins where the third inner join inflates the dataframe so my pc runs out of memory (16gb).
df.output <- inner_join(df,df, by='group' ) %>%
inner_join(.,df, by=c('person.y'='person')) %>%
inner_join(.,df, by=c('group.y'='group')) %>%
rename(person_in_common='person.y', pers1='person.x',pers2='person') %>%
select(pers1,pers2,person_in_common) %>%
filter(pers1!=pers2) %>%
distinct() %>%
filter(person_in_common!=pers1 & person_in_common!=pers2)
df.output[-3] <- t(apply(df.output[-3], 1,
FUN=function(x) sort(x, decreasing=FALSE)))
df.output <- unique(df.output)
Small data set example and expected output
df <- data.frame(group= c("a","a","b","b","b","c"),
person = c("Tom","Jerry","Tom","Anna","Sam","Nic"), stringsAsFactors = FALSE)
df
group person
1 a Tom
2 a Jerry
3 b Tom
4 b Anna
5 b Sam
6 c Nic
and expected result
df.output
pers1 pers2 person_in_common
1 Anna Jerry Tom
2 Jerry Sam Tom
3 Sam Tom Anna
4 Anna Tom Sam
6 Anna Sam Tom
I unfortunately don't have access to a machine with more ram and are also not really experienced with cloud computing, so I hope to make it work on my local pc. I would appreciate input how to optimize any of the solutions or an advise how to tackle the problem otherwise.
Edit 1
A dataframe which reflects my actual data size.
set.seed(33)
Data <- data.frame(group = sample(1:3700000, 14000000, replace=TRUE),
person = sample(1:6800000, 14000000,replace = TRUE))
Edit 2
My real data is a bit more complex in terms of larger groups and more persons per group as the example data. Consequently it gets more memory intense. I could not figure out how to simulate this kind of structure so following the real data for download:
Full person-group data
So, I managed to run this on your test data (I have 16 GB of RAM), but if you run this on your small example then you would see that it does not give the same results. I did not get why, but maybe you could hep me with that. So I will try to explain every step:
myFun <- function(dt) {
require(data.table)
# change the data do data.table:
setDT(dt)
# set key/order the data by group and person:
setkey(dt, group, person)
# I copy the initial data and change the name of soon to be merged column name to "p2"
# which represents person2
dta <- copy(dt)
setnames(dta, "person", "p2")
# the first merge using data.table:
dt1 <- dt[dta, on = "group", allow.cartesian = TRUE, nomatch = 0]
# now we remove rows where persons are the same:
dt1 <- dt1[person != p2] # remove equal persons
# and also we need to remove rows where person1 and person2 are the same,
# just in different order , example:
# 2: a Tom Jerry
# 3: a Jerry Tom
# is the same, if I get it right then you did this using apply in the end of code,
# but it would be much better if we could reduce data now
# also my approach will be much faster (we take pairwise min word to 2 column
# and max to the last):
l1 <- pmin(dt1[[2]], dt1[[3]])
l2 <- pmax(dt1[[2]], dt1[[3]])
set(dt1, j = 2L, value = l1)
set(dt1, j = 3L, value = l2)
# now lets clear memory and take unique rows of dt1:
rm(l1, l2, dt)
dt1 <- unique(dt1)
gc()
# change name for group column:
setnames(dta, "group", "g2")
# second merge:
dt2 <- dt1[dta, on = "p2", allow.cartesian = TRUE, nomatch = 0]
rm(dt1)
gc()
setnames(dta, "p2", "p3")
dt3 <- dt2[dta, on = "g2", allow.cartesian = TRUE, nomatch = 0] # third merge
rm(dt2)
gc()
dt3 <- dt3[p3 != p2 & p3 != person] # removing equal persons
gc()
dt3 <- dt3[, .(person, p2, p3)]
gc()
return(dt3[])
}
On Small data set example:
df <- data.frame(group = c("a","a","b","b","b","c"),
person = c("Tom","Jerry","Tom","Anna","Sam","Nic"),
stringsAsFactors = FALSE)
df
myFun(df)
# person p2 p3
# 1: Anna Tom Jerry
# 2: Sam Tom Jerry
# 3: Jerry Tom Anna
# 4: Sam Tom Anna
# 5: Jerry Tom Sam
# 6: Anna Tom Sam
# 7: Anna Sam Tom
Something similar to your result but not quite the same
Now with larger data:
set.seed(33)
N <- 10e6
dt <- data.frame(group = sample(3.7e6, N, replace = TRUE),
person = sample(6.8e6, N, replace = TRUE))
system.time(results <- myFun(dt)) # 13.22 sek
rm(results)
gc()
And:
set.seed(33)
N <- 14e6
dt <- data.frame(group = sample(3.7e6, N, replace = TRUE),
person = sample(6.8e6, N, replace = TRUE))
system.time(results <- myFun(dt)) # around 40 sek, but RAM does get used to max
Update:
Maybe you can try this splitting aproch, lets say with nparts 6-10?:
myFunNew3 <- function(dt, nparts = 2) {
require(data.table)
setDT(dt)
setkey(dt, group, person)
dta <- copy(dt)
# split into N parts
splits <- rep(1:nparts, each = ceiling(dt[, .N]/nparts))
set(dt, j = "splits", value = splits)
dtl <- split(dt, by = "splits", keep.by = F)
set(dt, j = "splits", value = NULL)
rm(splits)
gc()
i = 1
for (i in seq_along(dtl)) {
X <- copy(dtl[[i]])
setnames(dta, c("group", "person"))
X <- X[dta, on = "group", allow.cartesian = TRUE, nomatch = 0]
X <- X[person != i.person]
gc()
X <- X[dta, on = "person", allow.cartesian = TRUE, nomatch = 0]
gc()
setnames(dta, "group", "i.group")
X <- X[dta, on = "i.group", allow.cartesian = TRUE, nomatch = 0]
gc()
setnames(X, "i.person.1", "pers2")
setnames(X, "i.person", "pers1" )
setnames(X, "person", "person_in_common" )
X <- X[, .(pers1, pers2, person_in_common)]
gc()
X <- X[pers1 != pers2 & person_in_common != pers1 & person_in_common != pers2]
gc()
name1 <- "pers1"
name2 <- "pers2"
l1 <- pmin(X[[name1]], X[[name2]])
l2 <- pmax(X[[name1]], X[[name2]])
set(X, j = name1, value = l1)
set(X, j = name2, value = l2)
rm(l1, l2)
gc()
X <- unique(X)
gc()
if (i > 1) {
X1 <- rbindlist(list(X1, X), use.names = T, fill = T)
X1 <- unique(X1)
rm(X)
gc()
} else {
X1 <- copy(X)
}
dtl[[i]] <- 0L
gc()
}
rm(dta, dtl)
gc()
setkey(X1, pers1, pers2, person_in_common)
X1[]
}
I want to create from the dataset a list that contains word and frequency of the word . I did it and saved into val named 'mylist'. now I want to sort the list according to the frequency of the word and to create barplot from the 10 words that have the higher frequency.
but I not succeeded to sort it. I tried many ways to change the type of 'mylist' to data.frame or date.table but still the column of the frequency stay a list.
To sumup I have the DT var that contains it is a list with 2 columns x-contains the words and type is character .
The 2 column is 'v' - that contains the frequency and it is a list.
I am not succeeding to sort it by the frequency.
please help me.
library(ggplot2)
libary(MASS)
#get the data
data.uri = "http://www.crowdflower.com/wp-content/uploads/2016/03/gender-classifier-DFE-791531.csv"
pwd = getwd()
data.file.name = "gender.csv"
data.file = paste0(pwd, "./", data.file.name)
download.file(data.uri, data.file)
data = read.csv(data.file.name)
#manipulate the data
data <- data[data$X_unit_id < 815719694,]
print(data$X_unit_id)
#get all female has white sidebar
female_colors <- subset(data, data$gender=="female")
female_colors$fav_number
#get all male fav_numbers
male_colors <- subset(data, data$gender=="male")
male_colors$fav_number
text_male = subset(data, data$gender=="male")
text_male = text_male$text
print(text_male[1])
print(length(text_male))
v <- text_male[1:length(text_male)]
print(v)
print (v[1])
count_of_list = 0;
x = list()
for ( i in v) {
# Merge the two lists.
x <- c(x,unlist(strsplit(i," ")))
}
count = 0;
mylist = list()
for (word in x){
for (xWord in x){
if (word == xWord)
count = count + 1;
}
key <- word
value <- count
mylist[[ key ]] <- value
count = 0;
}
libary(data.table)
require(data.table)
DT = data.table(x=c(names(mylist)),v=c(mylist))
DT
As suggested in comments, a reproducible example would be useful in creating an answer to help you. I will suggest a proposal anyway. Try to adapt this peocedure to your data.
Convert your list to a dataframe and use order:
df <- as.data.frame(your.data)
df <- data.frame(id = c("B", "A", "D", "C"), y = c(6, 8, 1, 5))
df
id y
1 B 6
2 A 8
3 D 1
4 C 5
df2 <- df[order(df$id), ]
df2
id y
2 A 8
1 B 6
4 C 5
3 D 1
It looks like you're using a cumbersome way to calculate the word counts, something like this is faster and simpler -
library(dplyr)
foo <- c("ant", "ant", "bat", "dog","egg","ant","bat")
bar <- rnorm(7, 5, 2)
df <- data.frame(foo, bar)
group_by(df, foo) %>% summarise(n = n()) %>% arrange(desc(n))
foo n
(fctr) (int)
1 ant 3
2 bat 2
3 dog 1
4 egg 1
I'm thumbling around with the following problem, but to no evail:
d <- data.frame(value = 1:4, row.names = c("abc", "abcd", "ef", "gh"))
value
abc 1
abcd 2
ef 3
gh 4
l <- nrow(d)
wordmat <- matrix(rep(NA, l^2), l, l, dimnames = list(row.names(d), row.names(d)))
for (i in 1:ncol(wordmat)) {
rid <- agrep(colnames(wordmat)[i], rownames(wordmat), max = 0)
d$matchid[i] <- paste(rid, collapse = ";")
}
# desired output:
(d_agg <- data.frame(value = c(3, 3, 4), row.names = c("abc;abcd", "ef", "gh")))
value
abc;abcd 3
ef 3
gh 4
is there a function for this?
Here's a possible solution that you might be able to modify to suit your needs.
Some notes:
I couldn't figure out how to deal with rownames() directly, particularly in the last stage, so this depends on you being happy with copying your row names as a new variable.
The function below "hard-codes" the variable names, functions, and so on. That is to say, it is not by any means a generalized function, but one which might be useful as you look further into this problem.
Here's the function.
matches <- function(data, ...) {
temp = vector("list", nrow(data))
for (i in 1:nrow(data)) {
temp1 = agrep(data$RowNames[i], data$RowNames, value = TRUE, ...)
temp[[i]] = data.frame(RowNames = paste(temp1, collapse = "; "),
value = sum(data[temp1, "value"]))
}
temp = do.call(rbind, temp)
temp[!duplicated(temp$RowNames), ]
}
Note that the function needs a column called RowNames, so we'll create that, and then test the function.
d <- data.frame(value = 1:4, row.names = c("abc", "abcd", "ef", "gh"))
d$RowNames <- rownames(d)
matches(d)
# RowNames value
# 1 abc; abcd 3
# 3 ef 3
# 4 gh 4
matches(d, max.distance = 2)
# RowNames value
# 1 abc; abcd 3
# 3 abc; abcd; ef; gh 10
matches(d, max.distance = 4)
# RowNames value
# 1 abc; abcd; ef; gh 10
This works for your example but may need tweaking for the real thing:
d <- data.frame(value = 1:4, row.names = c("abc", "abcd", "ef", "gh"))
rowclust <- hclust(as.dist(adist(rownames(d))), method="single")
rowgroups <- cutree(rowclust, h=1.5)
rowagg <- aggregate(d, list(rowgroups), sum)
rowname <- unclass(by(rownames(d), rowgroups, paste, collapse=";"))
rownames(rowagg) <- rowname
rowagg
Group.1 value
abc;abcd 1 3
ef 2 3
gh 3 4