Count frequency of same value in several columns - r

I'm quite new to R and I'm facing a problem which I guess is quite easy to fix but I couldn't find the answer.
I have a dataframe called clg where basically I have 3 columns date, X1, X2.
X1 and X2 are name of country teams. X1 and X2 have the same list of countries.
I'm simply trying to count the frequency of each country in the two columns as a total.
So far, I've only been able to count the frequency of the X1 column but I didn't find a way to sum both columns.
clt <- as_tibble(na.omit(count(clg, clg$X1)))
I would like to get a data frame where in the first columns I have unique countries, and in the second column the sum of occurrences in X1 + X2.

You can useunlist() and table() to get the overall counts. Wrapping it in data.frame() will give you the desired two column output.
clg <- data.frame(date=1:3,
X1=c("nor", "swe", "alg"),
X2=c("swe", "alg", "jpn"))
data.frame(table(unlist(clg[c("X1", "X2")])))
# Var1 Freq
# 1 alg 2
# 2 nor 1
# 3 swe 2
# 4 jpn 1

With tidyverse, we can gather into 'long' format and then do the count
library(tidyverse)
gather(clg, key, Var1, -date) %>%
count(Var1)
# A tibble: 4 x 2
# Var1 n
# <chr> <int>
#1 alg 2
#2 jpn 1
#3 nor 1
#4 swe 2
data
clg <- structure(list(date = 1:3, X1 = structure(c(2L, 3L, 1L), .Label = c("alg",
"nor", "swe"), class = "factor"), X2 = structure(c(3L, 1L, 2L
), .Label = c("alg", "jpn", "swe"), class = "factor")),
class = "data.frame", row.names = c(NA,
-3L))

You can obtain your goal with two steps. In the first step, you calculate the sum of occurrences for each country. In the next step, you're joining the two df's together and calculate the total sum.
X1_sum <- df %>%
dplyr::group_by(X1) %>%
dplyr::summarize(n_x1 = n())
X2_sum <- df %>%
dplyr::group_by(X2) %>%
dplyr::summarize(n_x2 = n()
final_summary <- X1_sum %>%
# merging data with by country names
dplyr::left_join(., X2_sum, by = c("X1", "X2")) %>%
dplyr::mutate(n_sum = n_x1 + n_x2)

Related

Assign consecutive trial numbers to data frame beginning at match

I want to assign consecutive trial numbers (1-16) to a long dataframe depending on certain values of other variables.
It should look like this (simplified):
value trial_no
videoA 1
other 1
videoB 2
other 2
other 2
videoC 3
...
This basically does what I want, but it just assigns the row numbers.
df2 <- df1 %>%
mutate(trial_no = case_when(grepl('video', value) ~ row_number())) %>%
fill(trial_no)
This might do what I want, but yet it assigns 16 to all.
for (vid in c(1:16)) {
df2 <- df1 %>%
mutate(trial_no = case_when(grepl("video", value) ~ vid)) %>%
fill(trial_no)
}
I'm pretty sure there is an easy solution to this.
Any help is very much appreciated.
Using grepl and count the TRUEs
transform(dat, trial_no=cumsum(grepl('video', value)))
# value trial_no
# 1 videoA 1
# 2 other 1
# 3 videoB 2
# 4 other 2
# 5 other 2
# 6 videoC 3
Data:
dat <- structure(list(value = c("videoA", "other", "videoB", "other",
"other", "videoC")), class = "data.frame", row.names = c(NA,
-6L))

appending values from a look-up table to columns of another data frame based on trailing zero patterns

Data frame dat includes a set of numeric ids in a vector called code_num. Some of these ids end with one or more zeros. Others do not. Here are the first three lines:
code_num X1 X2 X3 … X50
251000 NA NA NA NA
112020 NA NA NA NA
537199 NA NA NA NA
The full data of dat are in the first tab of this google sheet.
Another data frame lut includes another set of numeric ids called code_num_moredetail that need to be associated with the higher-level identifiers in dat. Here are seven example observations of lut:
code_num_moredetail
251000.99
251743.00
251222.02
112020.01
112029.01
537119.00
537119.99
The full data of lut are in the second tab of this google sheet.
The trailing zeros in dat$code_num are wild card digits. Any value of lut$code_num_moredetail that match the numbers preceding the trailing zeros of dat$code_num should be considered a matching value, and needs to be added to the ith value of dat$X1 through dat$X50 (or beyond - I'm not certain how many matches to expect).
Consider two example cases:
if dat$code_num = 999000, then every value of lut$code_num_moredetail that matched the pattern 999###.## would need to be inserted into the columns that begin with the letter X in dat.
if dat$code_num = 999090 then every value of lut$code_num_moredetail that matched the pattern 99909#.## would need to be inserted into the columns that begin with the letter X in dat.
Using only the values provided in the example data frames, the final solution would make dat look like this:
code_num X1 X2 X3
251000 251000.99 251743.00 251222.02
112020 112020.01 112029.01 NA
537199 537119.00 537119.99 NA
I'm seeking an efficient way to augment dat with all wild-card-matched values of lut.
Note: some values of dat$code_num may not match any value of lut$code_num_moredetail - a proper solution must accommodate i matches, where i can range from 0 to 50.
Try
library(dplyr)
library(tidyr)
library(data.table)
library(stringr)
out <- lut %>%
mutate(new = substr(code_num_moredetail, 1, 3)) %>%
left_join(dat %>%
transmute(code_num, new = substr(code_num, 1, 3))) %>%
mutate(rn = str_c("X", rowid(new))) %>%
pivot_wider(names_from = rn, values_from = code_num_moredetail) %>%
select(-new)
-output
out
# A tibble: 3 x 4
code_num X1 X2 X3
<int> <dbl> <dbl> <dbl>
1 251000 251001. 251743 251222.
2 112020 112020. 112029. NA
3 537199 537119 537120. NA
The digits are in the data. It is just the tibble print
print(out$X3, digits = 10)
[1] 251222.02 NA NA
Or may be
library(fuzzyjoin)
dat1 <- dat %>%
transmute(code_num, new = sub("0+$", "", code_num))
lut$new <- str_replace(sub("\\..*", "", sprintf('%.2f', lut[[1]])),
paste0(".*(", paste(dat1$new, collapse="|"), ").*"), "\\1")
stringdist_left_join(lut, dat1) %>%
select(code_num_moredetail, code_num, new = new.x) %>%
mutate(rn = str_c("X", rowid(new))) %>%
pivot_wider(names_from = rn, values_from = code_num_moredetail) %>%
select(-new)
-output
# A tibble: 3 x 4
code_num X1 X2 X3
<int> <dbl> <dbl> <dbl>
1 251000 251001. 251743 251222.
2 112020 112020. 112029. NA
3 537199 537119 537120. NA
data
lut <- structure(list(code_num_moredetail = c(251000.99, 251743, 251222.02,
112020.01, 112029.01, 537119, 537119.99)), row.names = c(NA,
-7L), class = "data.frame")
dat <- structure(list(code_num = c(251000L, 112020L, 537199L),
X1 = c(NA,
NA, NA), X2 = c(NA, NA, NA), X3 = c(NA, NA, NA)), class = "data.frame",
row.names = c(NA,
-3L))

Best way to apply code to 24 similar datasets?

I have a 24 datasets that each have one factor and one response. I have written code to subset the 93 entries into 3 categories, but I'm not sure what the most efficient way there is to run this code for all 24 of my datasets. Any ideas would be much appreciated.
Here's the data I'm working with.
dput(head(data))
structure(list(run.size.percentage = structure(c(2L, 13L, 24L,
35L, 46L, 57L), .Label = c(",2000,", "1,0.375,0.013", "10,0.868,0.11",
"11,0.953,0.12", "12,1.047,0.12", "13,1.149,0.13", "14,1.261,0.14",
"15,1.385,0.14", "16,1.520,0.15", "17,1.668,0.15", "18,1.832,0.16",
"19,2.011,0.17", "2,0.412,0.023", "20,2.207,0.17", "21,2.423,0.18",
"22,2.660,0.19", "23,2.920,0.20", "24,3.205,0.21", "25,3.519,0.22",
"26,3.863,0.24", "27,4.240,0.25", "28,4.655,0.26", "29,5.110,0.28",
"3,0.452,0.034", "30,5.610,0.30", "31,6.158,0.31", "32,6.760,0.33",
"33,7.421,0.35", "34,8.147,0.37", "35,8.943,0.39", "36,9.817,0.42",
"37,10.78,0.45", "38,11.83,0.47", "39,12.99,0.50", "4,0.496,0.049",
"40,14.26,0.53", "41,15.65,0.56", "42,17.18,0.58", "43,18.86,0.59",
"44,20.70,0.59", "45,22.73,0.58", "46,24.95,0.55", "47,27.39,0.52",
"48,30.07,0.49", "49,33.01,0.46", "5,0.545,0.061", "50,36.24,0.45",
"51,39.78,0.45", "52,43.67,0.45", "53,47.94,0.44", "54,52.62,0.42",
"55,57.77,0.38", "56,63.41,0.35", "57,69.61,0.32", "58,76.42,0.31",
"59,83.89,0.33", "6,0.598,0.072", "60,92.09,0.36", "61,101.1,0.42",
"62,111.0,0.49", "63,121.8,0.59", "64,133.7,0.74", "65,146.8,0.94",
"66,161.2,1.19", "67,176.9,1.49", "68,194.2,1.82", "69,213.2,2.18",
"7,0.656,0.083", "70,234.1,2.55", "71,256.9,2.94", "72,282.1,3.34",
"73,309.6,3.78", "74,339.9,4.25", "75,373.1,4.73", "76,409.6,5.20",
"77,449.7,5.60", "78,493.6,5.87", "79,541.9,5.93", "8,0.721,0.093",
"80,594.9,5.77", "81,653.0,5.37", "82,716.8,4.77", "83,786.9,4.03",
"84,863.9,3.21", "85,948.3,2.36", "86,1041,1.55", "87,1143,0.81",
"88,1255,0.30", "89,1377,0.056", "9,0.791,0.10", "90,1512,0.0044",
"91,1660,0", "92,1822,0"), class = "factor")), row.names = c(NA,
6L), class = "data.frame")
Here's the code that worked for each dataset.
data2 <- tidyr::separate(names(data), unlist(strsplit(names(data), "\\.")), ",", data=data)
group1 <- data2 %>% filter(size <= 2)
group2 <- data2 %>% filter(size > 2 & size <= 50)
group3 <- data2 %>% filter(size > 50 & size <= 2000)
sum(as.numeric(group1$percentage), na.rm=TRUE)
sum(as.numeric(group2$percentage), na.rm=TRUE)
sum(as.numeric(group3$percentage), na.rm=TRUE)
Put your dataframes in a list and use lapply. Used cut to create the needed size groups. Also added convert = TRUE arg to separate to convert numbers into numeric -
df_list <- list(df, df) # creating a dummy list with same df
lapply(df_list, function(x) {
separate(names(df), unlist(strsplit(names(df), "\\.")), ",",
data = df, convert = TRUE) %>%
group_by(group = cut(size, breaks = c(0,2,50,2000,Inf))) %>%
summarise(percentage = sum(percentage))
})
# every list element is your desired output df
[[1]]
# A tibble: 1 x 2
group percentage
<fct> <dbl>
1 (0,2] 0.252
[[2]]
# A tibble: 1 x 2
group percentage
<fct> <dbl>
1 (0,2] 0.252

Remove row below conditionally in dataframe and add values together in R

I have a large dataset with 3 columns: Name, Country, and Sales.
I'd like to sum the Sales column by Names that are both identical and occur consecutively. Then I'd like to remove all rows but the first occurrence of a series, replacing the value of Sales with the series sum.
For example:
Name,Country,Sales
A,V,100
A,W,100
B,X,100
B,Y,100
A,Z,100
Would be reduced to:
Name,Country,Sales
A,V,200
B,X,200
A,Z,100
Anyone got any idea how to do this?
Your data
df <- structure(list(Name = c("A", "A", "B"), Country = c("X", "Y",
"Z"), Sales = c(100L, 100L, 100L)), .Names = c("Name", "Country",
"Sales"), row.names = c(NA, -3L), class = c("data.table", "data.frame"
))
dplyr solution
library(dplyr)
library(data.table)
ans <- df %>%
group_by(rleid(Name)) %>%
summarise(Name = unique(Name), Sales=sum(Sales)) %>%
select(-1)
Output
Name Sales
<chr> <int>
1 A 200
2 B 100
Alternative example
newdf <- rbind(df, data.frame(Name=c("A","A","B","B"),
Country=c("A","B","C","D"),
Sales=c(100,100,100,100)))
ans <- newdf %>%
group_by(rleid(Name)) %>%
summarise(Name = unique(Name), Sales=sum(Sales)) %>%
select(-1)
Output
Name Sales
<fctr> <dbl>
1 A 200
2 B 100
3 A 200
4 B 200
Here's another solution using sqldf:
library(data.table)
df <- fread("Name,Country,Sales
A,V,100
A,W,100
B,X,100
B,Y,100
A,Z,100")
df$rle = rleid(df$Name)
library(sqldf)
sqldf("select min(rowid) as row_names,
Name,
Country,
sum(Sales) as Sales
from df group by rle", row.names = TRUE)
# Name Country Sales
# 1 A V 200
# 3 B X 200
# 5 A Z 100
row.names = TRUE searches for a column named row_names and treats it as row names, so min(rowid) will not show up as a new column if I set it as row_names.
Try this:
require(dplyr)
df %>%
group_by(Series=rleid(Name)) %>%
mutate(Sales = sum(Sales)) %>%
filter(1:n() == 1)
Output:
Name Country Sales Series
1 A V 200 1
2 B X 200 2
3 A Z 100 3
Sample data:
require(data.table)
df <- fread("Name,Country,Sales
A,V,100
A,W,100
B,X,100
B,Y,100
A,Z,100")

create matrix of z-scores in R

I have a data.frame containing survey data on three binary variables. The data is already in a contingency table with the first 3 columns being answers (1=yes, 0 = no) and the fourth column showing the total number of answers. The rows is three different groups.
My aim is to calulate z-scores to check if the proportions are significantly different compared to the total
this is my data:
library(dplyr) #loading libraries
df <- structure(list(var1 = c(416, 1300, 479, 417),
var2 = c(265, 925,473, 279),
var3 = c(340, 1013, 344, 284),
totalN = c(1366, 4311,1904, 1233)),
class = "data.frame",
row.names = c(NA, -4L),
.Names = c("var1","var2", "var3", "totalN"))
and these are my total values
dfTotal <- df %>% summarise_all(funs(sum(., na.rm=TRUE)))
dfTotal
dfTotal <- data.frame(dfTotal)
rownames(dfTotal) <- "Total"
to calculate zScore I use the following formula:
zScore <- function (cntA, totA, cntB, totB) {
#calculate
avgProportion <- (cntA + cntB) / (totA + totB)
probA <- cntA/totA
probB <- cntB/totB
SE <- sqrt(avgProportion * (1-avgProportion)*(1/totA + 1/totB))
zScore <- (probA-probB) / SE
return (zScore)
}
is there a way using dplyr to calculate a 4x3 matrix that holds for all four groups and variables var1 to var3 the z-test-value against the total proportion?
I am currently stuck with this bit of code:
df %>% mutate_all(funs(zScore(., totalN,dftotal$var1,dfTotal$totalN)))
So the parameters currently used here as dftotal$var1 and dfTotal$totalN don't work, but I have no idea how to feed them into the formula. for the first parameter it must not be always var1 but should be var2, var3 (and totalN) to match the first parameter.
z-score in R is handled with scale:
scale(df)
var1 var2 var3 totalN
[1,] -0.5481814 -0.71592544 -0.4483732 -0.5837722
[2,] 1.4965122 1.42698064 1.4952995 1.4690147
[3,] -0.4024623 -0.04058534 -0.4368209 -0.2087639
[4,] -0.5458684 -0.67046986 -0.6101053 -0.6764787
If you want only the three var columns:
scale(df[,1:3])
var1 var2 var3
[1,] -0.5481814 -0.71592544 -0.4483732
[2,] 1.4965122 1.42698064 1.4952995
[3,] -0.4024623 -0.04058534 -0.4368209
[4,] -0.5458684 -0.67046986 -0.6101053
If you want to use your zScore function inside a dplyr pipeline, we'll need to tidy your data first and add new variables containing the values you now have in dfTotal:
library(dplyr)
library(tidyr)
# add grouping variables we'll need further down
df %>% mutate(group = 1:4) %>%
# reshape data to long format
gather(question,count,-group,-totalN) %>%
# add totals by question to df
group_by(question) %>%
mutate(answers = sum(totalN),
yes = sum(count)) %>%
# calculate z-scores by group against total
group_by(group,question) %>%
summarise(z_score = zScore(count, totalN, yes, answers)) %>%
# spread to wide format
spread(question, z_score)
## A tibble: 4 x 4
# group var1 var2 var3
#* <int> <dbl> <dbl> <dbl>
#1 1 0.6162943 -2.1978303 1.979278
#2 2 0.6125615 -0.7505797 1.311001
#3 3 -3.9106430 2.6607258 -4.232391
#4 4 2.9995381 0.4712734 0.438899

Resources