R- Printing a column if 2 other columns meet a value - r

I have a csv file with multiple columns. I want to print the values of one column if the values of two other columns are above a certain number. I then want to output this to a plain .txt file. I know how to do this on linux using awk but I don't think I can do this with R? (I am new)
Example data-
Height Age Name
145.2 13 David
170.3 20 Emma
100.1 8 Bob
200.5 23 Ben
176.6 19 Jim
180.7 20 James
165.8 25 Helen
So in this example, I am looking to output the name to a .txt file, if height is greater than or equal to (>=) 170 and also if the age is greater or equal to 19.
So it should output Emma, Ben, Jim and James to a new .txt file.
Any help would be great!
I have tried the code Jean suggested, I get this sort of output-
Emma, Ben, Jim ... Henry
Where it is just showing some of them and not a complete list in one column, just written across the console. I can't see the complete list on R or linux.
I have edited my data, in the real thing there is decimal points in the data which I think is causing the issue. Is there away around this?

You can print your output to a file this way.
sink(file='somefile.txt')
#print(x[x$Height>=170 & x$Age>=19, "Name"])
cat(paste((x[x$Height>=170 & x$Age>=19, "Name"]), collapse="\n"))
sink()

Related

Reading values from a data frame in R

I have been trying to find a way to combine the lines into one:
R: Code
Datensatz_LR[29]
Datensatz_LR[48,]
Datensatz_LR[63,]
Datensatz_LR[100,]
Output:
> Datensatz_LR[29,]
Word
29 Dog
> Datensatz_LR[48,]
Word
48 cat
> Datensatz_LR[63,]
Word
63 land
> Datensatz_LR[100,]
Word
100 shoe
Datensatz is the name of the file that I have imported into R. I can work with it just fine, but I want to have less code and thought that maybe I could write something like this
pseudocode:
Datensatz_LR[29,46,63,100]
desired output of the pseudocode:
Datensatz_LR[29,48,63,100]
Word
29 Dog
48 cat
63 land
100 shoe
Any help would be greatly appreciated.
Would there be a way to make a more elegant version of this ?
I apologise for the simple question. I am still relatively new to R.

Identifying, reviewing, and deduplicating records in R

I'm looking to identify duplicate records in my data set based on multiple columns, review the records, and keep the ones with the most complete data in R. I would like to keep the row(s) associated with each name that have the maximum number of data points populated. In the case of date columns, I would also like to treat invalid dates as missing. My data looks like this:
df<-data.frame(Record=c(1,2,3,4,5),
First=c("Ed","Sue","Ed","Sue","Ed"),
Last=c("Bee","Cord","Bee","Cord","Bee"),
Address=c(123,NA,NA,456,789),
DOB=c("12/6/1995","0056/12/5",NA,"12/5/1956","10/4/1980"))
Record First Last Address DOB
1 Ed Bee 123 12/6/1995
2 Sue Cord 0056/12/5
3 Ed Bee
4 Sue Cord 456 12/5/1956
5 Ed Bee 789 10/4/1980
So in this case I would keep records 1, 4, and 5. There are approximately 85000 records and 130 variables, so if there is a way to do this systematically, I'd appreciate the help. Also, I'm a total R newbie (as if you couldn't tell), so any explanation is also appreciated. Thanks!
#Add a new column to the dataframe containing the number of NA values in each row.
df$nMissing <- apply(df,MARGIN=1,FUN=function(x) {return(length(x[which(is.na(x))]))})
#Using ave, find the indices of the rows for each name with min nMissing
#value and use them to filter your data
deduped_df <-
df[which(df$nMissing==ave(df$nMissing,paste(df$First,df$Last),FUN=min)),]
#If you like, remove the nMissinig column
df$nMissing<-deduped_df$nMissing<-NULL
deduped_df
Record First Last Address DOB
1 1 Ed Bee 123 12/6/1995
4 4 Sue Cord 456 12/5/1956
5 5 Ed Bee 789 10/4/1980
Edit: Per your comment, if you also want to filter on invalid DOBs, you can start by converting the column to date format, which will automatically treat invalid dates as NA (missing data).
df$DOB<-as.Date(df$DOB,format="%m/%d/%Y")

Merge columns with the same name R

I'm fairly new to R. I'm working with a data set that is incredibly redundant with a lot of columns (~400). There are several duplicate column names, however the data is not duplicate, so I need to sum the columns when collapsing them.
The columns all have a similar name that allows easy identification, so I'm hoping I can use that to my advantage.
I attempted to perform the following:
ColNames <- unique(colnames(df))
CombinedDf <- data.frame(sapply(ColNames, function(i)rowSums(Test[,ColNames==i, drop=FALSE])))
This works if I sum over the range of columns that only contain integers, but the issue is that other columns have strings and such in them, so rowSums throws a fit.
Assuming that the identifier is "XXX", how can I aggregate all the columns that are of the same name leaving the other columns as is?
Thank you for your time.
Edit: Sample data has been asked for, I cannot give the exact data as it is sensitive, but I will give an example:
Name COL1XXX COL2XXX COL1XXX COL3XXX COL2XXX Type
Henry 5 15 25 31 1 Orange
Tom 8 16 12 4 3 Green
Should return
Name COL1XXX COL2XXX COL3XXX Type
Henry 30 16 31 Orange
Tom 20 19 4 Green
I'm not really sure, but you may try transposing the data and then aggregating by unique names.
t_df=as.data.frame(t(df))
new_df=aggregate(t_df, by=list(rownames(t_df)),sum)
Again, without sample data I'm unsure if it'll work, but based on what you said, that might work.

Split a text file according to data into multiple files in Unix

Assume the input file is sorted according to column 3 (the barcode begins with "TCGA"):
Joe 1 TCGA-A8-A08L-01A-11W-A019-09 T
John 2 TCGA-A8-A08L-01A-11W-A019-09 T
Jack 3 TCGA-A8-CVDL-01A-11W-A019-09 T
Jane 4 TCGA-A8-CVDL-01A-11W-A019-09 F
Justin 5 TCGA-A8-E08L-01A-11W-A019-09 F
Jasmine 6 TCGA-A8-E08L-01A-11W-A019-09 T
Jacob 7 TCGA-A8-E08L-01A-11W-A019-09 T
I want to split this text into new files with respect to 3rd column content as (just output 1st column values):
File-1:
Joe
John
File-2:
Jack
Jane
File-3:
Justin
Jasmine
Jacob
How can I achieve this?
Edit: The name of the files can be anything. It is not problem.
I tried many things such as using split, adding prefix and suffix for each segment, keeping track of previous line, etc. but there is an extremely simple solution that I could not think of at first:
awk -F' ' '{print $1 > $3}' inputfile
The name of files will be column 3's content.

Read.CSV not working as expected in R

I am stumped. Normally, read.csv works as expected, but I have come across an issue where the behavior is unexpected. It most likely is user error on my part, but any help will be appreciated.
Here is the URL for the file
http://nces.ed.gov/ipeds/datacenter/data/SFA0910.zip
Here is my code to get the file, unzip, and read it in:
URL <- "http://nces.ed.gov/ipeds/datacenter/data/SFA0910.zip"
download.file(URL, destfile="temp.zip")
unzip("temp.zip")
tmp <- read.table("sfa0910.csv",
header=T, stringsAsFactors=F, sep=",", row.names=NULL)
Here is my problem. When I open the data csv data in Excel, the data look as expected. When I read the data into R, the first column is actually named row.names. R is reading in one extra row of data, but I can't figure out where the "error" occurs that is causing row.names to be a column. Simply, it looks like the data shifted over.
However, what is strange is that the last column in R does appear to contain the proper data.
Here are a few rows from the first few columns:
tmp[1:5,1:7]
row.names UNITID XSCUGRAD SCUGRAD XSCUGFFN SCUGFFN XSCUGFFP
1 100654 R 4496 R 1044 R 23
2 100663 R 10646 R 1496 R 14
3 100690 R 380 R 5 R 1
4 100706 R 6119 R 774 R 13
5 100724 R 4638 R 1209 R 26
Any thoughts on what I could be doing wrong?
My tip: use count.fields() as a quick diagnostic when delimited files do not behave as expected.
First, count the number of fields using table():
table(count.fields("sfa0910.csv", sep = ","))
# 451 452
# 1 6852
That tells you that all but one of the lines contains 452 fields. So which is the aberrant line?
which(count.fields("sfa0910.csv", sep = ",") != 452)
# [1] 1
The first line is the problem. On inspection, all lines except the first are terminated by 2 commas.
The question now is: what does that mean? Is there supposed to be an extra field in the header row which was omitted? Or were the 2 commas appended to the other lines in error? It may be best to contact whoever generated the data, if possible, to clarify the ambiguity.
I have a fix maybe based on mnel's comments
dat<-readLines(paste("sfa", '0910', ".csv", sep=""))
ncommas<-sapply(seq_along(dat),function(x){sum(attributes(gregexpr(',',dat[x])[[1]])$match.length)})
> head(ncommas)
[1] 450 451 451 451 451 451
all columns after the first have an extra seperator which excel ignores.
for(i in seq_along(dat)[-1]){
dat[i]<-gsub('(.*),','\\1',dat[i])
}
write(dat,'temp.csv')
tmp<-read.table('temp.csv',header=T, stringsAsFactors=F, sep=",")
> tmp[1:5,1:7]
UNITID XSCUGRAD SCUGRAD XSCUGFFN SCUGFFN XSCUGFFP SCUGFFP
1 100654 R 4496 R 1044 R 23
2 100663 R 10646 R 1496 R 14
3 100690 R 380 R 5 R 1
4 100706 R 6119 R 774 R 13
5 100724 R 4638 R 1209 R 26
the moral of the story .... listen to Joshua Ulrich ;)
Quick fix. Open the file in excel and save it. This will also delete the extra seperators.
Alternatively
dat<-readLines(paste("sfa", '0910', ".csv", sep=""),n=1)
dum.names<-unlist(strsplit(dat,','))
tmp <- read.table(paste("sfa", '0910', ".csv", sep=""),
header=F, stringsAsFactors=F,col.names=c(dum.names,'XXXX'),sep=",",skip=1)
tmp1<-tmp[,-dim(tmp)[2]]
I know you've found an answer but as your answer helped me to find out this, I'll share:
If you read into R a file with different amount of columns for different rows, like this:
1,2,3,4,5
1,2,3,4
1,2,3
it would be read-in filling the missing columns with NAs, like this:
1,2,3,4,5
1,2,3,4,NA
1,2,3,NA,NA
BUT!
If the row with the biggest columns is not the first row, like this:
1,2,3,4
1,2,3,4,5
1,2,3
then it would be read in a bit confusing way:
1,2,3,4
1,2,3,4
5,NA,NA,NA
1,2,3,NA
(overwhelming before you figure out the problem and quite simple after!)
Just hope it may help someone!
If you using local data, also make sure that it's in the right place. To be sure put it for instance in your working directory and change it via
setwd("C:/[User]/[MyFolder]")
directly in your R-console.

Resources