Apache Pig : Counting frequencies of "character" - bigdata

Assume there is a text file named abalone_data with 3 attributes : name, gender and length
Kn,M,0.89
Un,F,0.77
An,M,0.89
Az,I,0.55
Au,M,0.72
with M is male, F is female and I is infant.
The question is how to count the number of abalone of each gender. The desired output should look like this
M,3
F,1
I,1
I have used the LOAD syntax to load the file, then use the TOKENIZE to get the gender only. But I am stuck at counting the characters.
abalone_data = LOAD 'hdfs://localhost:9000/pig_data/abalone.data' USING PigStorage(',')
as (name:chararray, gender:chararray, length:float);
abalone_gender = foreach abalone_data Generate TOKENIZE(sex);

Group by and Count on the group should be a safe bet? It's a fairly typical operation.
abalone_data = LOAD 'hdfs://localhost:9000/pig_data/abalone.data' USING PigStorage(',')
as (name:chararray, gender:chararray, length:float);
gender_counted = FOREACH (GROUP abalone_data BY gender)
GENERATE
group AS gender,
COUNT(abalone_data) AS gender_count
;

Related

Is there a way to map or match people's names to religions in R?

I'm working on a paper on electoral politics and tried using this dataset to calculate the share of the electorate that each religion,so I created an if() function and a Christian variable and tried to increase the number of Christians by one whenever a Christian name pops up, but was unable to do so. Would appreciate it if you could help me with this
library(dplyr)
library(ggplot2)
Christian=0
if(Sample...Sheet1$V2=="James"){
Christian=Christian+1
}
PS
The Output
Warning message:
In if (Sample...Sheet1$V2 == "James") { :
the condition has length > 1 and only the first element will be used
Notwithstanding my comment about the fundamental non-validity of this approach, here’s how you would solve this general problem in R:
Generate a lookup table of the different names and categories — this table is independent of your input data:
religion_lookup = tribble(
~ Name, ~ Religion,
'James', 'Christian',
'Christopher', 'Christian',
'Ahmet', 'Muslim',
'Mohammed', 'Muslim',
'Miriam', 'Jewish',
'Tarjinder', 'Sikh'
)
match your input data against the lookup table (I’m using an input table data with a column Name instead of your Sample...Sheet1$V2):
matched = match(data$Name, religion_lookup$Name)
religion = religion_lookup$Religion[matched]
Count the results:
table(religion)
religion
Christian Jewish Muslim Sikh
2 5 3 1
Note the lack of ifs and loops in the above.
Christian <- sum( Sample...Sheet1$V2=="James" )
There goes, don't need the if block.

R binning a categorical age group

I am trying to group the twitterAge in categorical bins. and add them to a new column to show the twitter age group in my
dataframe based on twitterAge by converting it into the following groupings or categories like the one below
[‘0-1’,’1-2’,’2-3’,’3-4’, ‘4-5’, ‘5+’]
‘0-1’ refers to the ages equal to older than 0 and younger then 1. Using the same logic for other age groups. the next would be older than 1 and younger than 2 .. etc
5+ is essentially referring to older than 5 but younger than 6
my approach is like this but I am afraid it wrong
breakpoints <- c(0,1,2,3,4,5,6)
name <- c(‘0-1’,’1-2’,’2-3’,’3-4’, ‘4-5’, ‘5+’)
# my data file name is twitter_data
twitter_data$twitterAgeGroup <- cut(twitter_data$twitterAge,breaks = breakpoint,labels = name)
would this be the approach suitable?

Create a new row to assign M/F to a column based on heading, referencing second table?

I am new to R (and coding in general) and am really stuck on how to approach this problem.
I have a very large data set; columns are sample ID# (~7000 samples) and rows are gene expression (~20,000 genes). Column headings are BIOPSY1-A, BIOPSY1-B, BIOPSY1-C, ..., BIOPSY200-Z. Each number (1-200) is a different patient, and each sample for that patient is a different letter (-A, -Z).
I would like to do some comparisons between samples that came from men and women. Gender is not included in this gene expression table. I have a separate file with patient numbers (BIOPSY1-200) and their gender M/F.
I would like to code something that will look at the column ID (ex: BIOPSY7-A), recognize that it includes "BIOPSY7" (but not == BIOPSY7 because there is BIOPSY7-A through BIOPSY7-Z), find "BIOPSY7" in the reference file, extrapolate M/F, and create a new row with M/F designation.
Honestly, I am so overwhelmed with coding this that I tried to open the file in Excel to manually input M/F, for the 7000 columns as it would probably be faster. However, the file is so large that Excel crashes when it opens.
Any input or resources that would put me on the right path would be extremely appreciated!!
I don't quite know how your data looks like, so I made mine based on your definitions. I'm sure you can modify this answer based on your needs and your dataset structure:
library(data.table)
genderfile <-data.frame("ID"=c("BIOPSY1", "BIOPSY2", "BIOPSY3", "BIOPSY4", "BIOPSY5"),"Gender"=c("F","M","M","F","M"))
#you can just read in your gender file to r with the line below
#genderfile <- read.csv("~/gender file.csv")
View(genderfile)
df<-matrix(rnorm(45, mean=10, sd=5),nrow=3)
colnames(df)<-c("BIOPSY1-A", "BIOPSY1-B", "BIOPSY1-C", "BIOPSY2-A", "BIOPSY2-B", "BIOPSY2-C","BIOPSY3-A", "BIOPSY3-B", "BIOPSY3-C","BIOPSY4-A", "BIOPSY4-B", "BIOPSY4-C","BIOPSY5-A", "BIOPSY5-B", "BIOPSY5-C")
df<-cbind(Gene=seq(1:3),df)
df<-as.data.frame(df)
#you can just read in your main df to r with the line below, fread prevents dashes to turn to period in r, you need data.table package installed and checked in
#df<-fread("~/first file.csv")
View(df)
Note that the following line of code removes the dash and letter from the column names of df (I removed the first column by df[,-c(1)] because it is the Gene id):
substr(x=names(df[,-c(1)]),start=1,stop=nchar(names(df[,-c(1)]))-2)
#[1] "BIOPSY1" "BIOPSY1" "BIOPSY1" "BIOPSY2" "BIOPSY2" "BIOPSY2" "BIOPSY3" "BIOPSY3" "BIOPSY3" "BIOPSY4" "BIOPSY4"
#[12] "BIOPSY4" "BIOPSY5" "BIOPSY5" "BIOPSY5"
Now, we are ready to match the columns of df with the ID in genderfile to get the Gender column:
Gender<-genderfile[, "Gender"][match(substr(x=names(df[,-c(1)]),start=1,stop=nchar(names(df[,-c(1)]))-2), genderfile[,"ID"])]
Gender
#[1] F F F M M M M M M F F F M M M
Last step is to add the Gender defined above as a row to the df:
df_withGender<-rbind(c("Gender", as.character(Gender)), df)
View(df_withGender)

Extracting a value based on multiple conditions in R

Quick question - I have a dataframe (severity) that looks like,
industryType relfreq relsev
1 Consumer Products 2.032520 0.419048
2 Biotech/Pharma 0.650407 3.771429
3 Industrial/Construction 1.327913 0.609524
4 Computer Hardware/Electronics 1.571816 2.019048
5 Medical Devices 1.463415 3.028571
6 Software 0.758808 1.314286
7 Business/Consumer Services 0.623306 0.723810
8 Telecommunications 0.650407 4.247619
if I wanted to pull the relfreq of Medical Devices (row 5) - how could I subset just that value?
I was thinking about just indexing and doing severity$relfreq[[5]], but I'd be using this line in a bigger function where the user would specify the industry i.e.
example <- function(industrytype) {
weight <- relfreq of industrytype parameter
thing2 <- thing1*weight
return(thing2)
}
So if I do subset by an index, is there a way R would know which index corresponds to the industry type specified in the function parameter? Or is it easier/a way to just subset the relfreq column by the industry name?
You would require to first select the row of interest and then keep the 2 column you requested (industryType and relfreq).
There is a great package that allows you to do this intuitively with tidyverse library(tidyverse)
data_want <- severity %>%
subset(industryType =="Medical Devices") %>%
select(industryType, relfreq)
Here you read from left to right with the %>% serving as passing the result to the next step as if nesting.
I think that selecting whole row is better, then choose column which you would like to see.
frame <- severity[severity$industryType == 'Medical Devices',]
frame$relfreq

Select variables that contain value in R

I apologize if this question has been answered. I have searched this for way too long.
I have coded data that has a prefix of a letter and suffix of numbers.
ex:
A01, A02,...A99 ### (for each letter A-Z)
I need R code that mirrors this SAS code:
Proc SQL;
Create table NEW as
Select *
From DATA
Where VAR contains 'D';
Quit;
EDIT
Sorry y'all, I'm new! (also, mediocre in R at best.) I thought posting the SAS/SQL code would help make it easier.
Anyway, the data is manufacturing data. I have a variable whose values are the A01...A99, etc. values.
(rough) example of the dataframe:
OBS PRODUCT PRICE PLANT
1 phone 8.55 A87
2 paper 105.97 X67
3 cord .59 D24
4 monitor 98.65 D99
The scale of the data is massive, and I'm only wanting to focus on the observations that come from the plant 'D', so I'm trying to subset the data based on the 'PLANT' variable that contains (or starts with) 'D'. I know how to filter the data with a specific value (ie. ==, >=, != , etc.). I just can't figure out how to do it when only part of the value is known and I have yet to find anything about a 'contains' operator in R. I hope that clarifies things more.
Assuming DATA is your data.frame and VAR is your column value,
DATA <- data.frame(
VAR=apply(expand.grid(LETTERS[1:4], 1:3), 1, paste0, collapse=""),
VAL = runif(3*4)
)
then you can do
subset(DATA, grepl("D", VAR))
A slight alternative to MrFlick's solution: use a vector of row-indices:
DATA[grep('D', DATA$VAR), ]
VAR VAL
4 D1 0.31001091
8 D2 0.71562382
12 D3 0.00981055
where we defined:
DATA <- data.frame(
VAR=apply(expand.grid(LETTERS[1:4], 1:3), 1, paste0, collapse=""),
VAL = runif(3*4)
)

Resources