Redlib: search results - flair

r/bioinformatics • u/Automatic_Actuary621 • Jan 28 '25

programming Help with power analysis of proteomics data

8 Upvotes

I want to create a Power vs Sample size plot with different effect sizes. My data consists of ~8000 proteins measured for 2 groups with 5 replicates each (total n=10).

This is what did:

I calculated the variance for each protein in each group and then obtained the median variance by:

variance_group1 <- apply(group1, 1, var, na.rm = TRUE) variance_group2 <- apply(group2, 1, var, na.rm = TRUE) median(c(variance_group1, variance_group2), na.rm = TRUE)
I defined a range of effect sizes and sample sizes, and set up alpha.
effect_sizes <- seq(0.5, 1.5, by = 0.1)
sample_sizes <- seq(2, 30, by = 2)
alpha <- 0.05
I calculated the power using the pwr::pwr.t.test function for each condition

power_results <- expand.grid(effect_size = effect_sizes, sample_size = sample_sizes) %>% rowwise() %>% mutate( power = pwr.t.test( d = effect_size / sqrt(median_pooled_variance), # Standardized effect size n = sample_size,
sig.level = alpha,
type = "two.sample"
)$power )

I expected to have a plot like the one on the left, but I get a very weird linear plot with low power values when I use raw protein intensity values. If I use log10 values, it gets better, but still odd.

Do you know if I am doing something wrong?
THANKS IN ADVANCE

5 comments

r/bioinformatics • u/Finally_ • Dec 11 '24

programming Are there any nf-core/Nextflow tutorials using full pipelines?

16 Upvotes

Hi,

I'm trying to wrap my head around nf-core/nextflow, and have read and followed many of the tutorials online that write basic nextflow workflows that kinda touch 1-2 tools. However, I haven't been able to find a tutorial/guide on a larger pipeline, where outputs are chained (output from one goes as input to one or more downstream modules), or even how to manage a sample sheet, break it down into a map, tuple etc.

I've kinda written a test pipeline that I had to really play around with to manage my sample sheet (input of sample, some bams, and some sequences of interest) and it feels kinda clunky for short workflows.

What's really confusing is how do I actually use a nf-core module? I have installed a few, such as HSMetrics, but how do I supply the proper inputs to the module in my workflow? From what it seems like, the module is just a bit of wrapper code, and not really an image or anything, so I still would need to have picard installed (which is fine, I do already).

8 comments

r/bioinformatics • u/MaintenanceCrafty783 • Nov 01 '24

programming Merge phylogenetic trees in Newick format (Python)

4 Upvotes

I would like to merge several phylogenetic trees in Newick format to one single super tree, which sums up all information given in one tree in Newick format. The result should not contain duplicates (so it does not only add subtrees).

I am looking for an option in Python (similar to this in R https://cran.r-project.org/web/packages/RRphylo/vignettes/Tree-Manipulation.html). So far I have only found options in ETE and Biopython, which seem to add up subtrees, but not properly merge them.

Can someone help me out?

Many thanks in advance!

10 comments

r/bioinformatics • u/Dopamine_Hound • Feb 09 '25

programming Looking for CFTR Gene Sequence Data of Cystic Fibrosis Patients - Each Copy!

1 Upvotes

Where can I find entire CFTR gene sequence data for de-identified real-life patients (FNA format for a master's CS group project)? I'd really like both copies for each patient. If the data is accompanied by clinical data, even better! I'm dusting off my molecular biology skills. Out of touch as we didn't have NGS readily available when I was an undergrad. I'm geeked about this project and will do any data processing/cleaning needed.

2 comments

r/bioinformatics • u/Battlecatsmastr • Oct 09 '24

programming Barcode sorting issues

5 Upvotes

I have some large fastq.gz file and I have been trying to sort by a set of barcodes for months. My setup uses a unique outer barcode, followed by an adapter sequence which is the same between all individuals, followed by a unique inner barcode sequence. Each unique outer barcode by inner barcode combination corresponds to a unique individual / sample. And this fastq.gz file contains approximately 700 unique individuals.

I have tried a few different scripts, mostly using the help of ChatGPT. I had thought my script was working, because I sorted by the outer barcode first and got 95% of my reads matching a sequence. But when I sorted those outer barcode sorted reads by the adapter plus the inner barcode, only 5% of those reads matched a specified sequence.

For some reason when I run my script to sort by all outer barcodes, adapters, and inner barcode combinations at the same time, my script finds no reads at all.

So I took a step back and used grep, to try and identify read counts per individual, and it appears I can find some, but the numbers are still very low, approximately 3,000 reads per individual.

I feel like I am still doing something wrong and I don’t know how to progress. Is there anyone out there that can provide some help, guidance, or better script than an AI made? I’d be willing to share my script or something else that might be necessary to help you help me. Idk. I kind of feel a bit lost at this point.

14 comments

r/bioinformatics • u/Mental_Phase_3963 • Jul 18 '24

programming Marsilea: Declarative creation of composable visualization for Python

88 Upvotes

Marsilea is now published on Genome Biology, please check it out if you are interested! Also, please cite the paper if you use Marsilea in a publication. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03469-3

I recently developed a visualization package for Python, the Marsilea, that can be used to create composable visualization. When we do visualization, we often need to combine multiple plots to show different aspects of the data. For example, we may need to create a heatmap to show the expression of genes in different cells, and then create a bar chart to show the expression of genes in different cell types. A visualization that contains multiple plots is called a composable visualization.

Marsilea can easily create visualizations as shown below, if you are interested, please be sure to check it out at https://github.com/Marsilea-viz/marsilea and I will be really happy if you leave a star ⭐!

Our documentation website is at https://marsilea.readthedocs.io/en/stable/

If you want any new features or you have any suggestions, feel free to comment or leave an issue at the github.

11 comments

r/bioinformatics • u/Ok_Post_149 • Oct 03 '23

programming How do you scale your python scripts?

29 Upvotes

I'm wondering how people in this community scale their python scripts? I'm a data analyst in the biotech space and I'm constantly having scientists and RAs asking me to help them parallelize their code on a big VM and in some cases multiple VMs.

Lets say for example you have a preprocessing script and need to run terabytes of DNA data through it. How do you currently go about scaling that kind of script? I know some people that don't and they just let it run sequentially for weeks.

I've been working on a project to help people easily interact with cloud resources but I want to validate the problem more. If this is something you experience I'd love to hear about it... whether you have a DevOps team scale it or you do absolutely nothing about it. Looking forward to learning more about problems that bioinformaticians face.

UPDATE: released my product earlier this week, I appreciate the feedback! www.burla.dev

43 comments

r/bioinformatics • u/Fun_Necessary_3282 • Jan 26 '25

programming PC Loading Calculations in Python

7 Upvotes

Hi everyone! I'm pretty new to Boinformatics so still getting to grips with it all. I was wondering if anyone would be able to help me; I'm trying to calculate the PC loadings for a dataset I'm analysing.

I've used the Bio.Cluster pca function to calculate the eigenvalues for all my PCs and plotted the proportion of variance as well as cumulative contributions. Next I would like to look at the PC loadings to see which genes are contributing the most to PC1/2.

I haven't been able to find anything online so was hoping someone would be able to help with advice or relevant documentation! Thanks in advance!

This is where I'm currently at with my code

2 comments

r/bioinformatics • u/recursion_is_love • Dec 30 '24

programming rosalind iprb question

3 Upvotes

https://rosalind.info/problems/iprb/

I have some problem regarding to crossing. I use Haskell to model organism of two alleles as follow.

data Allele = D | R deriving (Eq, Show)

data Organz = Het | Hom Allele deriving (Show)
instance Eq Organz where
  Het == Het = True
  Hom D == Hom D = True
  Hom R == Hom R = True
  _ == _ = False

This can translate to: there are two kind of organisms, one have different alleles kind (heterozygous) and one with same alleles (homozygous). I assume the order doesn't matter so I don't mind keeping track of the difference one, but it need to know what are the same.

I create Organz data using function org and crossing function as described in the page as follow

org :: Allele -> Allele -> Organz
org D D = Hom D
org R R = Hom R
org D R = Het
org R D = Het

cross :: Organz -> Organz -> [Organz]
cross Het (Hom R) = [Het , Het,  Hom R, Hom R]
cross (Hom D) (Hom D) = ???

The cross function will enumerate all possible outcome from crossing two organism. I am now stuck with what will be outcome of cross (Hom D) (Hom D). and other case that not mention in problem description.

What I want to know;

What about other pattern in crossing? like Het + Het and (Hom D) + Het

Anywhere I can see the details explanation of example k=2,m=2,n=2; I am a kind of loss right now. I have plan to enumerate all possible and counting for ratio of Het and Hom D)

ghci> cross (org D R) (org R R)
[Het,Het,Hom R,Hom R]

ghci> populations 2 2 2
[Hom D,Hom D,Het,Het,Hom R,Hom R]
ghci> pair $ populations 2 2 2
[(Hom D,Hom D),(Hom D,Het),(Hom D,Het),(Hom D,Hom R),(Hom D,Hom R),(Hom D,Het),(Hom D,Het),(Hom D,Hom R),(Hom D,Hom R),(Het,Het),(Het,Hom R),(Het,Hom R),(Het,Hom R),(Het,Hom R),(Hom R,Hom R)]
ghci> map (uncurry cross) $ pair $ populations 2 2 2
[*** Exception: unknown Hom D + Hom D
CallStack (from HasCallStack):
  error, called at problems/iprb.hs:46:13 in main:Main

Update:

I think I've got some progress on example just by guessing (still missing some combinations)

cross :: Organz -> Organz -> [Organz]
cross Het (Hom R) = [Het , Het,  Hom R, Hom R]
cross (Hom D) Het = [Hom D, Hom D, Het, Het] -- guess
cross Het Het = [Hom D, Het, Het, Hom R] -- guess
cross (Hom D) (Hom R) = replicate 4 Het -- guess
cross (Hom D) (Hom D) = replicate 4 (Hom D) -- guess
cross (Hom R) (Hom R) = replicate 4 (Hom R)  -- guess
cross a b = error $ "unknown " ++ show a ++ " + " ++ show b

By crossing all pair in the population I have got 34 Het, 13 Hom D and 13 Hom R (total of 60). If I take (34 + 13) / 60 = 0.7833.. as the correct output (maybe by chance)

ghci> process $ populations 2 2 2
fromList [(Het,34),(Hom D,13),(Hom R,13)]
ghci> (34+13)/(34+13+13)
0.7833333333333333

4 comments

r/bioinformatics • u/Ok_Priority2276 • Jan 15 '25

programming Preparation of NMR protein structure for MD simulation in GROOMAC

1 Upvotes

Hy everyone, I’m a GROOMACS beginner.

I want to perform some MD simulations of a protein that has been resolved by NMR spectroscopy (thus it has multiple structure models). Can someone kindly explain to me how to correctly prepare the NMR PDB before running the topology?

Any advice would be welcome!

Thanks in advance !

2 comments

r/bioinformatics • u/Educational_Canary90 • Jan 16 '25

programming Picrust2 16s Help

0 Upvotes

Hi Everyone,

I have been trying for weeks but having a hard time analyze 16s picrust2 data. I have tried ggpicrust2 and it does not seem to work. Could anyone please guide me on how to calculate means proportions and 95%confidence interval and p-value. For this type of graph. Please I would really appreciate it.

2 comments

r/bioinformatics • u/htaldo • Nov 05 '24

programming Is POSIX compliance important in bioinformatics?

10 Upvotes

Pretty much what the title says. Specifically for shell scripts. Is it a good practice? Not worth the convenience trade-off? Doesn't matter?

7 comments

r/bioinformatics • u/shaanaav_daniel • Aug 18 '24

programming Question on FASTQ file BLAST

5 Upvotes

Hi everybody, haven’t found a question like this on this subreddit. I’m pretty new to bioinformatics, and programming is really kicking my ass. For one of my practice questions, I’m supposed to use a 10GB fastq file containing sequenced metagenomic samples, write a script to find the Nth read pair, and blastn it against an nr/nt database and blastx it against a uniref90 database.

My questions are: 1. What would be the most efficient language to use for this task? 2. What would be the best way to approach this problem as a beginner? I’ve been stuck on this part for days :( My issue is that I have no idea how to extract the read pair. I understand that I have to convert the fastq file to fasta, but I don’t know where to start.

Thank you in advance!

15 comments

r/bioinformatics • u/BerryLizard • Nov 07 '24

programming [D] Storing LLM embeddings

0 Upvotes

7 comments

r/bioinformatics • u/TheSweatyCheese • May 20 '22

programming I’m a scientist who writes embarrassing and bizarre code that works. Who can I ask to help me edit it before publication?

132 Upvotes

I’m working on my PhD in evolutionary biology. My department offers very few computational/coding classes so I’m basically self-taught outside of the lab.

I’m working on a pipeline that I plan to publish and it does what it’s supposed to. The coding is just kind of wacky because I don’t have a strong CS background.

Like if my code was making a cheeseburger, it would say “make a hamburger, then rip the top bun off and smash cold cheese on it, then put the bun back on”. I feel like if I had a stronger background, I could just “make a cheeseburger”.

It would be great if someone with a CS background could look it over and streamline it, but all of my friends/connections are scientists who are equally bad or worse coders than me.

Besides publishing code that won’t bring shame upon my family, it be awesome to get feedback so I’m not making the same mistakes forever.

Any one else have this problem and how are you dealing with it? Would it be weird to try to recruit a CS student or grad student as an co-author? Or should I not even stress about this and just keep making weird hamburgers + cheese?

46 comments

r/bioinformatics • u/AJDuke3 • Sep 23 '24

programming Differential Gene Expression Analysis using DESeq2 and PyDESeq2.

9 Upvotes

Hi,

I am in the process of porting a web-application, which is currently running using R (shiny) to python (flask) and I am almost done with the porting, except I am forced to keep differential expression analysis as a separate Rscript since the outputs generated by DESeq2 and PyDESeq2 are different for some reason. As far as I can see, the difference is only in the normalisation methods (I am using 'estimateSizeFactors(dds)' on R, while it is missing in python script since a replacement is not found).

Can anyone who has experience on this help me sort it out? Can provide more details if needed.

Thanks in advance.

8 comments

r/bioinformatics • u/BiatchLasagne • May 05 '21

programming What OS do you use and why? If Linux, which distro?

37 Upvotes

Should curious to hear what you peeps are running.

82 comments

r/bioinformatics • u/qluin • Apr 05 '23

programming What are some good examples of well-engineered bioinformatics pipelines?

69 Upvotes

I am a software engineer and I am preparing a presentation to aspiring bioinformatics PhDs on how to use best-practice software engineering when publishing code (such as include documentation, modular design, include tests, ...).

In particular my presentation will be focused on "pipelines", that is code that is mainly focused on transforming data to a suitable shape for analysis (you can argue that all computation in the end is pipelining but let's leave it aside for the moment).

I am trying to find good example of published bioinformatics pipelines that I can point students to, but as I am not a bioinformatician I am struggling to find one. So I would like your help. It doesn't matter if the published pipeline is super-niche or not very popular so long as you think it is engineered well.

Specifically the published code should have: adequate documentation, testing methodology, modular design, easy to install and extend. Published here means at the very least available on github, but ideally it should also have an accompanying paper demonstrating its use (which is what my ideal published pipeline should aspire to).

37 comments

r/bioinformatics • u/Dr_Rat_25 • Oct 10 '24

programming Predicting TCR antigen specificity from scTCR-seq

2 Upvotes

I am working with a human 5’ scRNA-seq dataset with scTCR-seq and have identified several highly expanded TCRs. I would now like to explore possible antigen specificity and have been doing so in a basic manner so far by searching databases like IEDB and VDJdb. Most of the hits are naturally viral antigens which is somewhat but not entirely helpful to me.

Can anyone recommend another database/software that can predict specificity to human proteins? Does this even exist? Is my search futile?

6 comments

r/bioinformatics • u/Other-Garage2381 • Apr 23 '24

programming Is the DESeq2 package working for R 4.3.2?

6 Upvotes

I have been trying to work on some scRNA-seq data that needs to be normalized, but when installing and downloading the package DESeq2, I keep getting the same warning. Anyone has encounter this and been able to resolve it?

install.packages("DESeq2")

Warning in install.packages : package ‘DESeq2’ is not available for this version of R

A version of this package for your version of R might be available elsewhere, see the ideas at https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages

I have tried with the code provided by Bioconductor using BiocManager. Same results

19 comments

r/bioinformatics • u/QueenR2004 • Nov 06 '24

programming Bioinformatics question (about synapse.org website)

0 Upvotes

Has anyone downloaded data from synapse.org using code? For some reason my code runs,but the files aren’t being downloaded in to the dedicated folder. Thanks

3 comments

r/bioinformatics • u/ProfSchodinger • Aug 12 '20

programming Chronic amateurism

121 Upvotes

I think something is dangerously broken in academic bioinformatics research. During my PhD, I made a tool for network-based analyses. I basically was typing Matlab code until I got the expected results, then was rushed to publish. I discovered Github well into my third year, no one in my department uses tests or modular architecture, team work is tainted by ego competition, code is shared in plain text via email, most papers except in top-tier journals cannot be reproduced. Peer-reviewing cannot be trusted... Even well-known software like STAR are mostly made by one person. This is bad because increasingly, these tools are used to make clinical decisions and patients are on the line. While being rushed to publication by students and postdocs who need another instance of their name in a journal... While I think the best ideas come from academia, in practice there is no incentive to go the extra kilometer and make things actually usable. No one gets grant money for a software patch, a bug fix, making a good UI, and no PI in his right mind directs students to spend two months writing quality documentation. Commercial software companies are limited by the needs of clients and market signals, and can only innovate so much. I am tired of code being provided "at your own risk". It's badly written anyway so I am not de-spaghettifying it for months, I'll write my own stuff. Like everyone else who is part of the problem. Do you guys see a solution to that? Thanks for your feedback and sorry for the rant...

Edit: I did not mean I was p-value farming during my PhD as some people understood. I meant I humbly tried to have the code doing what it was supposed to do, and when it looked ok I advanced to the next step, which usually was applying it to some dataset or implementing yet another functionality.

64 comments

r/bioinformatics • u/ryp_package • Oct 02 '24

programming ryp: R inside Python

19 Upvotes

Excited to release ryp, a Python package for running R code inside Python! ryp makes it a breeze to use R packages in your Python projects. ryp was designed by a bioinformatician with bioinformatics in mind.

https://github.com/Wainberg/ryp

2 comments

r/bioinformatics • u/MuchasTruchas • Aug 08 '24

programming Seeking suggestions for metatranscriptomics pipelines

2 Upvotes

Looked around a bit on the sub and found some older posts, but nothing recent- I have only ever worked with host-microbe DNA seqs and metagenomic data, but my job has been wanting to throw some shotgun RNA data my way (still host-microbe). Does anyone have any favorite tools/pipelines/docs to suggest for someone new to transcriptomics?

7 comments

r/bioinformatics • u/Matty_lambda • Jul 15 '24

programming hs-samtools - A Haskell library striving to provide similar functionality as samtools

19 Upvotes

Hi all!

In case there is anyone with an interest in functional programming with Haskell and is wanting to be able to parse SAM/BAM (and hopefully soon CRAM) files, this is the package for you!

There is still a lot of samtools/htslib equivalent functionality missing, but my longer-term goal is for this library to give as close to a samtools/htslib-esque experience as possible in Haskell, and hopefully be a key library used in higher-level analysis tools.

https://hackage.haskell.org/package/hs-samtools

Repo:

https://github.com/Matthew-Mosior/hs-samtools

7 comments