merging tables into one data frame

merging tables into one data frame - r

I have several columns in my source data frame containing the same three possible variables (1, 2 and 3) over several hundred rows. I'm using the table function to summarize each column as shown here
column1 <- table(data$column1)
column2 <- table(data$column2)
column3 <- table(data$column3)
...
These tables print out results of the form below
1 2 3
6 74 300
I'm trying to combine all of these tables into one data frame of this form
1
2
3
column1
6
74
300
column2
2
87
298
column3
4
57
489
How do I make this happen? Thank you!

We can use the tidyverse, suppose your data is called dat:
library(tidyverse)
dat %>%
pivot_longer(cols = everything()) %>%
count(name, value) %>%
pivot_wider(names_from = value, values_from = n)
# name `1` `2` `3`
# 1 column1 6 74 300
# 2 column2 2 87 298
# 3 column3 4 57 489

Got a solution using rbind from reddit. It does exactly what I was looking for.
#selected all tables in the environment
tables = sapply(.GlobalEnv, is.table)
#rbinded them
allquestions <- do.call(rbind, mget(names(tables)[tables]))

Related

Rebuild tibble under condition

My Tibble:
df1 <- tibble(a = c("123*", "123", "124", "678*", "678", "679", "677"))
# A tibble: 7 x 1
a
<chr>
1 123*
2 123
3 124
4 678*
5 678
6 679
7 677
What it should become:
# A tibble: 3 x 2
a b
<chr> <chr>
1 123 124
2 678 679
3 678 677
The values with the stars refer to the following values with no stars, until a new value with a star comes and so on.
Each value with a star should go to the first column, the other values (except the ones that are identical to the values with a star, except the star) should go to the second column. If one value with a star is followed by several values, they should still be linked to eachother, so the values in the first column are duplicated to keep the connection.
I know how to filter and bring the values in each column, but not sure how i would keep the connection.
Regards

We can use tidyverse. Create a grouping column based on the occurence of * in 'a', extract the numeric part with parse_number, get the distinct rows, grouped by 'grp', create a new column with the first value of 'b'
library(dplyr)
library(stringr)
df1 %>%
transmute(grp = cumsum(str_detect(a, fixed("*"))),
b = readr::parse_number(a)) %>%
distinct(b, .keep_all = TRUE) %>%
group_by(grp) %>%
mutate(a = first(b)) %>%
slice(-1) %>%
ungroup %>%
select(a, b)
-output
# A tibble: 3 × 2
a b
<dbl> <dbl>
1 123 124
2 678 679
3 678 677

Here is one base R option -
Using cumsum and grepl we split the data on occurrence of *.
In each group, we drop the values which are similar to the star values and create a dataframe with two columns.
Finally, combine the list of dataframes in one combined dataframe.
result <- do.call(rbind, lapply(split(df1,
cumsum(grepl('*', df1$a, fixed = TRUE))), function(x) {
a <- x[[1]]
a[1] <- sub('*', '', a[1], fixed = TRUE)
data.frame(a = a[1], b = a[a != a[1]])
}))
rownames(result) <- NULL
result
# a b
#1 123 124
#2 678 679
#3 678 677

Keeping one row and discarding others in R using specific criteria?

I'm working with the data frame below, which is just part of the full data, and I need to condense the duplicate numbers in the id column into one row. I want to preserve the row that has the highest sbp number, unless it's 300 or over, in which case I want to discard that too.
So for example, for the first three rows that have id as 13480, I want to keep the row that has 124 and discard the other two.
id,sex,visits,sbp
13480,M,2,124
13480,M,3,306
13480,M,4,116
13520,M,2,124
13520,M,3,116
13520,M,4,120
13580,M,2,NA
13580,M,3,124
This is the farthest I got, been trying to tweak this but not sure I'm on the right track:
maxsbp <- split(sbp, sbp$sbp)
r <- data.frame()
for (i in 1:length(maxsbp)){
one <- maxsbp[[i]]
index <- which(one$sbp == max(one$sbp))
select <- one[index,]
r <- rbind(r, select)
}
r1 <- r[!(sbp$sbp>=300),]
r1

I think a tidy solution to this would work quite well. I would first filter all values above 300, if you do not want to keep any value above that threshold. Then group_by id, order, and keep the first.
my.df <- data.frame("id" = c(13480,13480,13480,13520,13520,13520,13580,13580),
"sex" = c("M","M","M","M","M","M","M","M"),
"sbp"= c(124,306,116,124,116,120,NA,124))
my.df %>% filter(sbp < 300) # filter to retain only values below 300
%>% group_by(id) # group by id
%>% arrange(-sbp) # arrange by id in descending order
%>% top_n(1, sbp) # retain first value i.e. the largest
# A tibble: 3 x 3
# Groups: id [3]
# id sex sbp
# <dbl> <chr> <dbl>
#1 13480 M 124
#2 13520 M 124
#3 13580 M 124

In R, very rarely you'll require explicit for loops to do tasks.
There are functions available which will help you perform such grouped operations.
For example, in base R you can use subset and ave :
subset(df,sbp == ave(sbp,id,FUN = function(x) max(sbp[sbp <= 300],na.rm = TRUE)))
# id sex visits sbp
#1 13480 M 2 124
#4 13520 M 2 124
#8 13580 M 3 124
The same can be done using dplyr whose syntax is a little bit easier to understand.
library(dplyr)
df %>%
group_by(id) %>%
filter(sbp == max(sbp[sbp <= 300], na.rm = TRUE))

slice_head can also be used
my.df <- data.frame("id" = c(13480,13480,13480,13520,13520,13520,13580,13580),
"sex" = c("M","M","M","M","M","M","M","M"),
"sbp"= c(124,306,116,124,116,120,NA,124))
> my.df
id sex sbp
1 13480 M 124
2 13480 M 306
3 13480 M 116
4 13520 M 124
5 13520 M 116
6 13520 M 120
7 13580 M NA
8 13580 M 124
Proceed simply like this
my.df %>% group_by(id, sex) %>%
arrange(desc(sbp)) %>%
slice_head() %>%
filter(sbp <300)
# A tibble: 2 x 3
# Groups: id, sex [2]
id sex sbp
<dbl> <chr> <dbl>
1 13520 M 124
2 13580 M 124

Select variables with different names

I have some DF’s with different variable names, but they have the same content. Unfortunately, my files have no pattern, but I am now trying to standardize them. For example, I have these 4 DF’s and I would like to select only one variable:
KEY_WIN <- c(123,456,789)
COUNTRY <- c("USA","FRANCE","MEXICO")
DF1 <- data.frame(KEY_WIN,COUNTRY)
KEY_WINN <- c(12,55,889)
FOOD <- c("RICE","TOMATO","MANGO")
CAR <- c("BMW","FERRARI","TOYOTA")
DF2 <- data.frame(KEY_WINN,FOOD,CAR)
ID <- c(555,698,33)
CITY <- c("NYC","LONDON","PARIS")
DF3 <- data.frame(ID,CITY)
NUMBER <- c(3,436,1000)
OCEAN <- c("PACIFIC","ATLANTIC","INDIAN")
DF4 <- data.frame(NUMBER,OCEAN)
I would like to create a routine to select only the variables KEY_WIN, KEY_WINN, ID, NUMBER. My expected result would be:
DF_FINAL<- data.frame(KEY=c(123,456,789, 12,55,889, 555,698,33, 3,436,1000))
How would I select only those variables?

There are multiple ways I would imagine you could approach this.
First, you could put your data frames in a list:
listofDF <- list(DF1, DF2, DF3, DF4)
Then, you could bind_rows to add the data frames together, and then coalesce to merge into one column.
library(tidyverse)
bind_rows(listofDF) %>%
mutate(KEY = coalesce(KEY_WIN, KEY_WINN, ID, NUMBER)) %>%
select(KEY)
KEY
1 123
2 456
3 789
4 12
5 55
6 889
7 555
8 698
9 33
10 3
11 436
12 1000
If you knew that the first column was always your KEY column, you could simply do:
KEY = unlist(lapply(listofDF, "[[", 1))
This would extract the first column from all of your data frames:
[1] 123 456 789 12 55 889 555 698 33 3 436 1000

R Frequency table for multiselect survey question across several columns

I want to do a fairly common analysis of survey questions in R, but am stuck in the middle.
Imagine a survey where you are asked to answer which brands do you associate with certain features (e.g. "brands" could be PlayStation, XBox..., and features could be "speed", "graphics"... where each brand can be checked on several features aka mulit-select). E.g. sth. like this here: https://www.harvestyourdata.com/fileadmin/images/question-type-screenshots/Grid-multi-select.jpg
You often refer to these questions as multi-select grid or matrix questions.
Anyway, from a data perspective, this kind of data is usually stored in wide format where each row*column combination is one variable, which is 0/1 coded (0 if the survey participant doesn't check the box, 1 otherwise).
Assuming we have 5 brands and 10 items, we would have 50 variables in total, ideally following a nice, structured naming scheme, e.g. item1_column1, item2_column1, item3_column1, [...], item1_column2 and so on.
Now, I want to analyze (frequency table) all of these variables in one iteration. I've already found the cross.multi.table function in the questionr package. However, it only allows to analyze all items based on on single factor. What I need instead is to allow for several columns at the same time.
Any ideas? MIght be I'm missing a function from another package or this can easily be done with tidyverse or even with the cross.multi.table function?
Using this data as test input:
dat = data.frame(item1_column1 = c(0,1,1,1),
item2_column1 = c(1,1,1,0),
item3_column1 = c(0,0,1,1),
item1_column2 = c(1,1,1,0),
item2_column2 = c(0,1,1,1),
item3_column2 = c(1,0,1,1),
item1_column3 = c(0,1,1,0),
item2_column3 = c(1,1,1,1),
item3_column3 = c(0,0,1,0))
I'd expect this output:
column1 column2 column3
item1 3 3 2
item2 3 3 4
item3 2 3 1
or ideally as proportions/percentages:
column1 column2 column3
item1 75% 75% 50%
item2 75% 75% 100%
item3 50% 75% 25%

One way could be to get data into long format using gather, separate columns based on _, group_by item and column and calculate the ratio of value column and spread the data to wide format.
library(dplyr)
library(tidyr)
dat %>%
gather(key, value) %>%
separate(key, into = c("item", "column"), sep = "_") %>%
group_by(item, column) %>%
summarise(prop = mean(value) * 100) %>%
spread(column, prop)
# item column1 column2 column3
# <chr> <dbl> <dbl> <dbl>
#1 item1 75 75 50
#2 item2 75 75 100
#3 item3 50 75 25
A bit shorter (Thanks to #M-M)
dat %>%
summarise_all(~mean(.) * 100) %>%
gather(key, value) %>%
separate(key, into = c("item", "column"), sep = "_") %>%
spread(column, value)

What I do here, by using data.table package, is summarizing each column, converting data to long format, breaking a column to two (item and column), and finally converting to wide format. Look below;
library(data.table)
dcast(setDT(melt(setDT(dat)[,100*colMeans(.SD),]),keep.rownames = T)[,
c("item", "column") := tstrsplit(rn, "_", fixed=TRUE)],
item ~ column, value.var = "value")
#> item column1 column2 column3
#> 1: item1 75 75 50
#> 2: item2 75 75 100
#> 3: item3 50 75 25

We can do this in base R, by creating a two column data.frame with the column names replicated, cbind with the unlisted values, and use xtabs to get the sum while pivoting to 'wide' format
out <- xtabs(val ~ ., cbind(read.table(text = names(dat)[col(dat)],
sep="_", header = FALSE), val = unlist(dat, use.names = FALSE)))
out
# V2
#V1 column1 column2 column3
# item1 3 3 2
# item2 3 3 4
# item3 2 3 1
Or as #GKi mentioned (a compact version would be) to split the column names by _, create a data.frame with that along with colSums (or colMeans - for percentage) and use xtabs for pivoting
xtabs(n ~ ., data.frame(do.call("rbind",
strsplit(colnames(dat), "_")), n=colSums(dat)))
Or to get the percentage
xtabs(val ~ ., aggregate(val ~ ., cbind(read.table(text = names(dat)[col(dat)],
sep="_", header = FALSE), val = unlist(dat, use.names = FALSE)), mean)) * 100
# V2
#V1 column1 column2 column3
# item1 75 75 50
# item2 75 75 100
# item3 50 75 25
Or inspired from #GKi, using enframe
library(dplyr)
library(tidyr)
library(tibble)
enframe(colSums(dat)) %>%
separate(name, into = c('name1', 'name2')) %>%
spread(name2, value)
# A tibble: 3 x 4
# name1 column1 column2 column3
# <chr> <dbl> <dbl> <dbl>
#1 item1 3 3 2
#2 item2 3 3 4
#3 item3 2 3 1
To get the percentage, just change the first line of code to
enframe(100 *colMeans(dat))

group by in R dplyr for more than one variable on unique value of other variable

I have a dataset with three columns as below:
data <- data.frame(
grpA = c(1,1,1,1,1,2,2,2),
idB = c(1,1,2,2,3,4,5,6),
valueC = c(10,10,20,20,10,30,40,50),
otherD = c(1,2,3,4,5,6,7,8)
)
valueC is unique to each unique value of idB.
I want to use dplyr pipe (as the rest of my code is in dplyr) and use group_by on grpA to get a new column with sum of valueC values for each group.
The answer should be like:
newCol <- c(40,40,40,40,40,120,120,120)
but with data %>% group_by(grpA) %>%
mutate(newCol=sum(valueC), I get newCol <- c(70,70,70,70,70,120,120,120)
How do I include unique value of idB? Is there anything else I can use instead of group_by in dplyr %>% pipe.
I cant use summarise as I need to keep values in otherD intact for later use.
Other option I have is to create newCol separately through sql and then merge with left join. But I am looking for a better solution inline.
If it has been answered before, please refer me to the link as I could not find any relevant answer to this issue.

We need unique with match
data %>%
group_by(grpA) %>%
mutate(ind = sum(valueC[match(unique(idB), idB)]))
# A tibble: 8 x 5
# Groups: grpA [2]
# grpA idB valueC otherD ind
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 10 1 40
#2 1 1 10 2 40
#3 1 2 20 3 40
#4 1 2 20 4 40
#5 1 3 10 5 40
#6 2 4 30 6 120
#7 2 5 40 7 120
#8 2 6 50 8 120
Or another option is to get the distinct rows by 'grpA', 'idB', grouped by 'grpA', get the sum of 'valueC' and left_join with the original data
data %>%
distinct(grpA, idB, .keep_all = TRUE) %>%
group_by(grpA) %>%
summarise(newCol = sum(valueC)) %>%
left_join(data, ., by = 'grpA')

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

merging tables into one data frame - r

We can use the tidyverse, suppose your data is called dat: library(tidyverse) dat %>% pivot_longer(cols = everything()) %>% count(name, value) %>% pivot_wider(names_from = value, values_from = n) # name `1` `2` `3` # 1 column1 6 74 300 # 2 column2 2 87 298 # 3 column3 4 57 489

Got a solution using rbind from reddit. It does exactly what I was looking for. #selected all tables in the environment tables = sapply(.GlobalEnv, is.table) #rbinded them allquestions <- do.call(rbind, mget(names(tables)[tables]))

Related

Rebuild tibble under condition

Keeping one row and discarding others in R using specific criteria?

Select variables with different names

R Frequency table for multiselect survey question across several columns

group by in R dplyr for more than one variable on unique value of other variable

Categories

Resources