How to rearrange/tidy column in R data frame? - r

Suppose we have this data set, where avg_1, avg_2 and avg_3 repeat themselves:
avg_1 avg_2 avg_3 party_gender
424 242 213 RM
424 242 213 RF
424 242 213 DM
How can I edit this using R so that the data set looks like this (where the avg values aren't repeated, and avg_1, avg_2 and avg_3 correspond to RM, RF and DM respectively):
avg party_gender
424 RM
242 RF
213 DM

Admittedly, this is a bit hacky and doesn't work nicely if you have more than just a few conditions for the avg. value:
library(tidyverse)
dat %>%
pivot_longer(-party_gender) %>%
filter(party_gender == "RM" & value == 424 |
party_gender == "RF" & value == 242 |
party_gender == "DM" & value == 213) %>%
mutate(name = "avg") %>%
pivot_wider()
which gives:
# A tibble: 3 x 2
party_gender avg
<chr> <dbl>
1 RM 424
2 RF 242
3 DM 213

Related

r count instances where variables x and y are equal and place in table

I have the following code
length(which(tor$TorL==1& tor$SID==351))
length(which(tor$TorL==1 & tor$SID==352))
## The result of this is as follows
[1] 3843
[1] 223
The lines of code give me the count of TorL when SID==xxx.
TorL is a binary variable of a low value
SID goes from 351 to 358, I am only showing 351.
My second code query is
length(which(tor$TorH==1 & tor$SID==351))
length(which(tor$TorH==1 & tor$SID==352))
## Results from above
[1] 155
[1] 96
TorH is a binary variable of a high value
I would like to able to do this count as above and place the results in a table, something like as follows, as I would like to do a correlation on the results.
SID TorL TorH
351 3843 223
352 155 96
Thanks
With tidyverse:
df <- data.frame(SID = sample(c(351, 352, 353), 30, replace = T),
TorL = sample(c(0,1), 30, replace = T),
TorR = sample(c(0,1), 30 , replace = T))
df %>% group_by(SID) %>% summarise_at(vars(TorL, TorR), sum) %>% ungroup()
# A tibble: 3 × 3
SID TorL TorR
<dbl> <dbl> <dbl>
1 351 6 8
2 352 3 6
3 353 6 6
I got it working, playing around a little with asafpr answer
{r}
torlh <- df %>% group_by(SID)%>%
summarise(ltor = sum(TorL), htor = sum(TorH))
torlh

Merge rows depending on 2 column values

region. age. pop
SSC21184 0 209
SSC21184 1 195
SSC21184 2 242
SSC21184 3 248
SSC21185 0 231
SSC21185 1 287
SSC21185 2 268
SSC21185 3 257
I'm looking to:
group age groups (column 2) for ages <2 and >=2,
find the population for these age groups, for each region
so it should look something like this:
region. age_group. pop
SSC21184 <2 404
SSC21184 >=2 490
SSC21185 <2 518
SSC21185 >=2 524
I've attempted using tapply(df$pop, df$agegroup, FUN = mean) %>% as.data.frame(), however I continue to get the error: arguments must have same length
Edit: If possible, how would I be able to plot the population per age group per region? As for example, a stacked bar graph?
Thank you!
If you have only two age groups to change we can use ifelse :
library(dplyr)
df %>%
group_by(region, age = ifelse(age >=2, '>=2', '<2')) %>%
summarise(sum = sum(pop))
# region age sum
# <chr> <fct> <int>
#1 SSC21184 < 2 404
#2 SSC21184 >=2 490
#3 SSC21185 < 2 518
#4 SSC21185 >=2 525
A more general solution would be with cut if you have large number of age groups.
df %>%
group_by(region, age = cut(age, breaks = c(-Inf, 1, Inf),
labels = c('< 2', '>=2'))) %>%
summarise(sum = sum(pop))
We can use the same logic in tapply as well.
with(df, tapply(pop, list(region, ifelse(age >=2, '>=2', '<2')), sum))
# <2 >=2
#SSC21184 404 490
#SSC21185 518 525

Filtering rows across multiple columns in R

I have a tibble with eleven columns and I would like to filter out in ten columns (PC1:PC10) the rows that are not equal to 145.
I have tried to solve this with a for loop. However, this does not work. Is there another option or can somebody explain to me where my error is? I also tried with lapply but I do not know how to integrate the filter function. Thank you very much.
install.packages("tidyverse")
library(tidyverse)
set.seed(120)
data.matrix <- matrix(nrow=100, ncol=10)
colnames(data.matrix) <- c(
paste("PC", 1:10, sep=""))
rownames(data.matrix) <- paste("food", 1:100, sep="")
for (i in 1:100) {
wt.values <- rpois(10, lambda=sample(x=10:1000, size=1))
data.matrix[i,] <- c(wt.values)
}
head(data.matrix)
#> PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
#> food1 145 150 136 147 134 158 152 141 152 115
#> food2 629 615 592 636 617 595 618 602 621 626
#> food3 343 355 401 378 361 393 365 374 352 371
#> food4 420 433 417 394 431 430 458 453 404 459
#> food5 866 850 885 826 845 781 838 835 850 857
#> food6 10 7 7 11 7 4 8 11 9 12
dim(data.matrix)
#> [1] 100 10
data <- data.matrix %>% data.frame() %>% rownames_to_column(var = "food_groups") %>% as_tibble()
# Normally I would do:
data %>% filter(!PC1 == 145 ) %>% select(PC1)
data %>% filter(!PC2 == 145 ) %>% select(PC2)
data %>% filter(!PC3 == 145 ) %>% select(PC3)
# However, I would like to avoid repetition by looping (or lapply...)
# I tried this and it does not work:
fltr <- function(y) {
f <- filter(!y == 145)
f
}
loadings_final <- function(x) {
nc <- ncol(x)
filters <- numeric(nc)
for (i in 1:nc) {
filters[i] <- fltr(x[,i])
}
filters
}
loadings_final(data)
#> Error in UseMethod("filter_"): nicht anwendbare Methode für 'filter_' auf Objekt der Klasse "c('matrix', 'logical')" angewendet
Created on 2020-05-07 by the reprex package (v0.3.0)
You can get list of values using lapply :
list_output <- lapply(data[-1], function(x) data.frame(col = x[x != 145]))
This can also be done with map
list_output <- purrr::map(data[-1], ~tibble(col = .x[.x != 145]))
library(reshape2)
data %>%
melt(., id.vars = "food_groups", measure_vars=c('PC1','PC2','PC3','PC4','PC5','PC6','PC7','PC8','PC9','PC10')) %>%
filter(value != 375)
returns:
food_groups variable value
1 food2 PC1 92
2 food3 PC1 801
3 food4 PC1 398
4 food5 PC1 238
5 food6 PC1 213
6 food7 PC1 281
7 food8 PC1 1031 ....
You notice that PC1-food1 combination is filtered out.
You can then split this into a list of tibbles:
library(reshape2)
data %>%
melt(., id.vars = "food_groups", measure_vars=c('PC1','PC2','PC3','PC4','PC5','PC6','PC7','PC8','PC9','PC10')) %>%
filter(value != 375) %>%
group_split(variable) -> mylist
After that:
# name list elements
names(mylist) <- c('PC1','PC2','PC3','PC4','PC5','PC6','PC7','PC8','PC9','PC10')
# assign to global environment
list2env(mylist,globalenv())
# now you have:
> ls()
[1] "data" "data.matrix" "i" "mylist" "PC1"
[6] "PC10" "PC2" "PC3" "PC4" "PC5"
[11] "PC6" "PC7" "PC8" "PC9" "wt.values"
Edit: #Ronak Shah's answer below provides an oneliner approach for generating a list of tibbles split according to PC. After executing his oneliner, you only need to call list2env() to get to the desired output. If one likes brevity, his answer is preferred.

str_match based on vector with count issue

I havent got a reprex but my data are stored in a csv file
https://transcode.geo.data.gouv.fr/services/5e2a1fbefa4268bc25628f27/feature-types/drac:site?format=CSV&projection=WGS84
library(readr)
bzh_sites <- read_csv("site.csv")
I want to count row based on characters matching (column NATURE)
pattern<-c("allée|aqueduc|architecture|atelier|bas|carrière|caveau|chapelle|château|chemin|cimetière|coffre|dépôt|dolmen|eau|église|enceinte|enclos|éperon|espace|exploitation|fanum|ferme|funéraire|groupe|habitat|maison|manoir|menhir|monastère|motte|nécropole|occupation|organisation|parcellaire|pêcherie|prieuré|production|rue|sépulture|stèle|thermes|traitement|tumulus|villa")
test2 <- bzh_sites %>%
drop_na(NATURE) %>%
group_by(NATURE = str_match( NATURE, pattern )) %>%
summarise(n = n())
gives me :
NATURE n
1 allée 176
2 aqueduc 73
3 architecture 68
4 atelier 200
AND another test with the same data (NATURE)
pattern <- c("allée|aqueduc|architecture|atelier")
test2 <- bzh_sites %>%
drop_na(NATURE) %>%
group_by(NATURE = str_match( NATURE, pattern )) %>%
summarise(n = n())
gives me :
NATURE n
1 allée 178
2 aqueduc 74
3 architecture 79
4 atelier 248
I have no idea about the différences of count.
I tried to find out where the discrepancy is for first group i.e "allée". This is what I found :
library(stringr)
pattern1<-c("allée|aqueduc|architecture|atelier|bas|carrière|caveau|chapelle|château|chemin|cimetière|coffre|dépôt|dolmen|eau|église|enceinte|enclos|éperon|espace|exploitation|fanum|ferme|funéraire|groupe|habitat|maison|manoir|menhir|monastère|motte|nécropole|occupation|organisation|parcellaire|pêcherie|prieuré|production|rue|sépulture|stèle|thermes|traitement|tumulus|villa")
#Get indices where 'allée' is found using pattern1
ind1 <- which(str_match(bzh_sites$NATURE, pattern1 )[, 1] == 'allée')
pattern2 <- c("allée|aqueduc|architecture|atelier")
#Get indices where 'allée' is found using pattern1
ind2 <- which(str_match(bzh_sites$NATURE, pattern2)[, 1] == 'allée')
#Indices which are present in ind2 but absent in ind1
setdiff(ind2, ind1)
#[1] 3093 10400
#Get corresponding text
temp <- bzh_sites$NATURE[setdiff(ind2, ind1)]
temp
#[1] "dolmen allée couverte" "coffre funéraire allée couverte"
What happens when we use pattern1 and pattern2 on temp
str_match(temp, pattern1)
# [,1]
#[1,] "dolmen"
#[2,] "coffre"
str_match(temp, pattern2)
# [,1]
#[1,] "allée"
#[2,] "allée"
As we can see using pattern1 certain values are classified in another group since they occur first in the string hence we have a mismatch.
A similar explanation can be given for mismatches in other groups.
str_match only returns first match, to get all the matches in pattern we can use str_match_all
table(unlist(str_match_all(bzh_sites$NATURE, pattern1)))
# allée aqueduc architecture atelier bas
# 178 76 79 252 62
# carrière caveau chapelle château chemin
# 46 35 226 205 350
# cimetière coffre dépôt dolmen eau
# 275 155 450 542 114
# église enceinte enclos éperon space
# 360 655 338 114 102
#exploitation fanum ferme funéraire groups
# 1856 38 196 1256 295
# habitat maison manoir menhir monastère
# 1154 65 161 1036 31
# motte nécropole occupation organisation parcellaire
# 566 312 5152 50 492
# pêcherie prieuré production rue sépulture
# 69 66 334 44 152
# stèle thermes traitement tumulus villa
# 651 50 119 1232 225

R summing row one with all rows

I am trying to analyse website data for AB testing.
My reference point is based on experimentName = Experiment 1 (control version)
experimentName UniquePageView UniqueFrequency NonUniqueFrequency
1 Experiment 1 459 294 359
2 Experiment 2 440 286 338
3 Experiment 3 428 273 348
What I need to do is sum every UniquePageView, UniqueFrequency and NonUniqueFrequency row when experimentName = Experiment 1
e.g.
UniquePageView WHERE experimentName = 'Experiment 1 ' + UniquePageView WHERE experimentName = 'Experiment 2 ',
UniquePageView WHERE experimentName = 'Experiment 1 ' + UniquePageView WHERE experimentName = 'Experiment 3 '
so on so forth (I could have an unlimted number of experiment #)
then do the same for UniqueFrequency and NonUniqueFrequency (I could have an unlimited number of column as well)
Result expected:
experimentName UniquePageView UniqueFrequency NonUniqueFrequency Conversion Rate Pooled UniquePageView Conversion Rate Pooled UniqueFrequency Conversion Rate Pooled NonUniqueFrequency
1 Experiment 1 459 294 359 918 588 718
2 Experiment 2 440 286 338 899 580 697
3 Experiment 3 428 273 348 887 567 707
here is the math behind it:
experimentName UniquePageView UniqueFrequency NonUniqueFrequency Conversion Rate Pooled UniquePageView Conversion Rate Pooled UniqueFrequency Conversion Rate Pooled NonUniqueFrequency
1 Experiment 1 459 294 359 459 + 459 294 + 294 359 + 359
2 Experiment 2 440 286 338 459 + 440 294 + 286 359 + 338
3 Experiment 3 428 273 348 459 + 428 294 + 273 359 + 348
In base R, you can do this in one line by column binding (with cbind) the initial data frame to the initial data frame plus a version that is just duplicates of the "Experiment 1" row).
cbind(dat, dat[,-1] + dat[rep(which(dat$experimentName == "Experiment 1"), nrow(dat)), -1])
# experimentName UniquePageView UniqueFrequency NonUniqueFrequency UniquePageView UniqueFrequency
# 1 Experiment 1 459 294 359 918 588
# 2 Experiment 2 440 286 338 899 580
# 3 Experiment 3 428 273 348 887 567
# NonUniqueFrequency
# 1 718
# 2 697
# 3 707
To update the column names at the end (assuming you stored the resulting data frame in res), you could use:
names(res)[4:6] <- c("CombinedPageView", "CombinedUniqueFrequency", "CombinedNonUniqueFrequency")
Do you know how to use dplyr? If you're new to R, this is a pretty good lesson to learn. Dplyr includes the functions filter and summarise, which are all you need to do this problem - very simple!
First, take your data frame
df
Then, filter to only the data you want, in this case when experimentName = Experiment 1
df
df <- filter(df, experimentName == "Experiment 1")
Now, summarise to find the sums of UniquePageView, UniqueFrequency and NonUniqueFrequency
df
df <- filter(df, experimentName == "Experiment 1")
summarise(df, SumUniquePageView = sum(UniquePageView),
SumUniqueFrequency = sum(UniqueFrequency),
SumNonUniqueFrequency = sum(NonUniqueFrequency))
This will return a small table with the answers you're looking for. For a slightly more advanced (but simpler) way to do this, you can use the piping operator %>% from the packages magrittr. That code borrows the object from the previous statement and uses it as the first argument in the proceeding statement, as follows:
df %>% filter(experimentName == "Experiment 1") %>% summarise(SumUniquePageView = sum(UniquePageView), etc)
If you don't yet have those packages, you can get them with install.packages("dpyr"), library(dplyr)

Resources