I'm very new to R and can't get a hold of using pipes for trivial commands. How to write these correctly working commands using pipes instead? The following two problems are not related.
1) I'm trying to remove duplicates from my dataframe and replace the old dataframe with a new one that has no duplicated values.
2) I'm trying to change factor format to date format.
1) df <- df[!duplicated(df),]
2) df$date_col <- anytime(df$date_col,
useR = getOption("anytimeUseRConversions", FALSE),
oldHeuristic = getOption("anytimeOldHeuristic", FALSE))
Here is one option
library(dplyr)
library(anytime)
df %>%
distinct() %>%
mutate(date_col = anytime(date_col))
Related
I am sort of a newbie with R and data manipulation and I am trying to transpose the UCI words dataset. The default dataset is currently structured as so.
Where the first column is the document number, the second column is the word number referencing another text file and the last column is the number of times the word occurs in the document. (For now, we can forget about the third column and I know how to drop it from the dataset.)
What I am trying to do is to transpose the dataset so that I can have each document's words in one row. So a simple example would be like this.
I tried using the t() function but it would transpose the entire dataset all together which is not what I want. I looked in using the dplyr package to help with the data manipulation but I am not getting any solid leads. If you guys have any sources or a particular direction you can nudge me towards accomplishing this that would helpfull.
Thank you!
Here's a solution using the tidyverse package (which includes dplyr). The trick is to first add another column to differentiate entries with the same value in the first column (document number) and then just change the data to wide format using pivot_wider.
library(tidyverse)
# Your data
df <- read.csv(text = "num word
1 61
2 76
1 89
3 211
3 296", sep = " ")
df %>%
# Group by num
group_by(num) %>%
# Add a rownumber to differentiate entries for the same first column value
mutate(rownum = row_number()) %>%
# Change data to wide format
pivot_wider(id = num,
names_from = rownum,
values_from = word)
So I was able to figure out how to accomplish this task. Hopefully, it helps other DS's in the future.
data <- read.table("docword.kos.txt", sep = " ")
data <- data %>% select(V1, V2)
trans <- data %>%
group_by(V1) %>%
summarise(words = paste(V2, collapse = ","))
trans <- trans %>% select(words)
What I ended up doing is using the tidyr package to perform some data wrangling and group my dataset by the first column. Then I exported and re uploaded the dataset after making some slight adjustments in notepad (Replaced the " from the generated csv file)
write.csv(trans, "~\trend.csv", row.names = FALSE)
I am trying to convert the following block of codes written in R to Python:
df <- df %>%
group_by("column_1") %>%
mutate(new_col1 = length(which(column_x < 1)),
new_col2 = new_col1 /counter)
df: is a dataframe
My attempt to do this in Python is the following blocks:
df = df.groupby(['column_1']).apply(
new_col1=len(df[df['column_x']] < 1)),
new_col2= df['new_col1'] / num_samples)
But I am getting the following error:
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
Note that column new_col2 needs new_col1 to be created and so I couldn't find a way to combine the operation of creating two columns with custom behavior and group them by a single column from the data frame.
How would I able to convert the above R block of codes into a working python code using pandas?
Thanks a lot in advance,
So we do transform
df['new_col1'] = (df['column_x'] < 1).groupby(df['column_1']).transform('sum')
df['new_col2'] = df['new_col1']/num_samples
dplyr::mutate here is equal to transform, but transform only accept one column calculation
I am trying to compute some value using group by and trying to add the result back to data frame in new variables using mutate function. But somehow its not working.
I have found many post with same problem on the forum and tried few I found relative like using <- assigning operator, chanting piping operator %<>% etc but neither has worked so far.
Here is my code. Please see what's wrong I am doing.
library(dplyr)
library(lubridate)
df3 %>%
group_by(Day = day(ymd_hms(timestamp))) %>%
mutate(pressure_m = mean(pressure)) %>%
mutate(pressure_s = sum(pressure))
I want pressure_m & pressure_s to be in the original data frame. These only shows when I run the above code. But in the data frame in environment, they are not.
Try this:
library(dplyr)
library(lubridate)
df3 %>%
group_by(Day = day(ymd_hms(timestamp))) %>%
mutate(
pressure_m = mean(pressure),
pressure_s = sum(pressure)
) -> df3
I have an HMMSCAN result file of protein domains with 10 columns. please see the link for the CSV file.
https://docs.google.com/spreadsheets/d/10d_YQwD41uj0q5pKinIo7wElhDj3BqilwWxThfIg75s/edit?usp=sharing
But I want it to look like this:-
1BVN:P|PDBID|CHAIN|SEQUENCE Alpha-amylase Alpha-amylase_C A_amylase_inhib
3EF3:A|PDBID|CHAIN|SEQUENCE Cutinase
3IP8:A|PDBID|CHAIN|SEQUENCE Amdase
4Q1U:A|PDBID|CHAIN|SEQUENCE Arylesterase
4ROT:A|PDBID|CHAIN|SEQUENCE Esterase
5XJH:A|PDBID|CHAIN|SEQUENCE DLH
6QG9:A|PDBID|CHAIN|SEQUENCE Tannase
The repeated entries of column 3 should get grouped and its corresponding values of column 1, which are in different rows, should be arranged in separate columns.
This is what i wrote till now:
df <- read.csv ("hydrolase_sorted.txt" , header = FALSE, sep ="\t")
new <- df %>% select (V1,V3) %>% group_by(V3) %>% spread(V1, V3)
I hope I am clear with the problem statement. Thanks in advance!!
Your input data set has two unregular rows. However, the approach in your solution is right but one more step is required:
library(dplyr)
df %>% select(V3,V1) %>% group_by(V3) %>% mutate(x = paste(V1,collapse=" ")) %>% select(V3,x)
What we did here is simply concentrating strings by V3. Before running the abovementioned code in this solution you should preprocess and fix some improper rows manually. The rows (TIM, Dannase, and DLH). To do that you can use the Convert texts into column function of in Excel.
Required steps defined are below. Problematic columns highlighted yellow:
Sorry for the non-English interface of my Excel but the way is self-explanatory.
This is my first stackoverflow question.
I'm trying to use dplyr to process and output a summary of data grouped by a categorical variable (inj_length_cat3) in my dataset. Actually, I generate this variable (from inj_length) on the fly using mutate(). I also want to output the same summary of the data without grouping. The only way I figured out how to do that is to do the analysis twice over, once with, once without grouping, and then combine the outputs. Ugh.
I'm sure there is a more elegant solution than this and it bugs me. I wonder if anyone would be able to help.
Thanks!
library(dplyr)
df<-data.frame(year=sample(c(2005,2006),20,replace=T),inj_length=sample(1:10,20,replace=T),hiv_status=sample(0:1,20,replace=T))
tmp <- df %>%
mutate(inj_length_cat3 = cut(inj_length, breaks=c(0,3,100), labels = c('<3 years','>3 years')))%>%
group_by(year,inj_length_cat3)%>%
summarise(
r=sum(hiv_status,na.rm=T),
n=length(hiv_status),
p=prop.test(r,n)$estimate,
cilow=prop.test(r,n)$conf.int[1],
cihigh=prop.test(r,n)$conf.int[2]
) %>%
filter(inj_length_cat3%in%c('<3 years','>3 years'))
tmp_all <- df %>%
group_by(year)%>%
summarise(
r=sum(hiv_status,na.rm=T),
n=length(hiv_status),
p=prop.test(r,n)$estimate,
cilow=prop.test(r,n)$conf.int[1],
cihigh=prop.test(r,n)$conf.int[2]
)
tmp_all$inj_length_cat3=as.factor('All')
tmp<-merge(tmp_all,tmp,all=T)
I'm not sure you consider this more elegant, but you can get a solution to work if you first create a dataframe that has all your data twice: once so that you can get the subgroups and once to get the overall summary:
df1 <- rbind(df,df)
df1$inj_length_cat3 <- cut(df$inj_length, breaks=c(0,3,100,Inf),
labels = c('<3 years','>3 years','All'))
df1$inj_length_cat3[-(1:nrow(df))] <- "All"
Now you just need to run your first analysis without mutate():
tmp <- df1 %>%
group_by(year,inj_length_cat3)%>%
summarise(
r=sum(hiv_status,na.rm=T),
n=length(hiv_status),
p=prop.test(r,n)$estimate,
cilow=prop.test(r,n)$conf.int[1],
cihigh=prop.test(r,n)$conf.int[2]
) %>%
filter(inj_length_cat3%in%c('<3 years','>3 years','All'))